Comprehensive curation and validation of genomic datasets for chestnut

Fan, Jialu; Zhang, Yu; Nie, Xinghua; Liu, Yang; Wei, Shangxiao; Peng, Haixu; Li, Hanlei; Zhang, Mingjun; Ning, Lu; Wang, Sen; Qin, Ling; Zheng, Yi; Xing, Yu

doi:10.1038/s41597-025-05162-x

Download PDF

Data Descriptor
Open access
Published: 24 May 2025

Comprehensive curation and validation of genomic datasets for chestnut

Jialu Fan^1,2^na1,
Yu Zhang ORCID: orcid.org/0009-0005-6907-7393^1,3^na1,
Xinghua Nie^1,3,
Yang Liu^1,3,
Shangxiao Wei^1,2,
Haixu Peng^1,2,
Hanlei Li^1,2,
Mingjun Zhang^1,2,
Lu Ning⁴,
Sen Wang ORCID: orcid.org/0000-0002-3497-3068^1,2,5,
Ling Qin^1,3,
Yi Zheng ORCID: orcid.org/0000-0002-8042-7770^1,2,5 &
…
Yu Xing^1,3

Scientific Data volume 12, Article number: 860 (2025) Cite this article

2860 Accesses
2 Citations
Metrics details

Subjects

Abstract

The Chinese chestnut (Castanea mollissima) stands out as a plant with significant ecological and economic value, excellent nutritional quality and natural resistance to pests and diseases. Recent strides in high-throughput techniques have enabled the continuous accumulation of genomic data on chestnuts, presenting a promising future for genetic research and advancing traits in this species. To facilitate the accessibility and utility of this data, we have curated and analyzed a collection of genomic datasets for eight Castanea species, including functional annotations, 213 RNA-Seq samples, and 330 resequencing samples. These datasets are publicly available on Figshare and are also available through other platforms such as GEO and EVA, providing a valuable resource for researchers studying Castanea genetics, functional genomics, and evolutionary biology. Furthermore, the datasets are integrated into the Castanea Genome Database (CGD, http://castaneadb.net), which serves as a complementary platform, offering advanced data mining and analysis tools, including BLAST, Batch Query, GO/KEGG Enrichment Analysis, and Synteny Viewer, to enhance the usability of the curated datasets.

Chromosome-level genome assembly of Oriental chestnut gall wasp (Dryocosmus kuriphilus)

Article Open access 04 September 2024

Cyto-molecular characterization of rDNA and chromatin composition in the NOR-associated satellite in Chestnut (Castanea spp.)

Article Open access 15 January 2024

Representation and participation across 20 years of plant genome sequencing

Article Open access 29 November 2021

Background & Summary

Castanea is a genus of plants in the family Fagaceae that has significant ecological and economic value¹, including excellent nutritional quality². The genus comprises seven species, including the Chinese chestnut (Castanea mollissima), Seguin chestnut (Castanea seguinii), Henry chestnut (Castanea henryi), and Japanese chestnut (Castanea crenata) in East Asia, the American chestnut (Castanea dentata) and Chinkapin (Castanea pumila) are present in North America, while European chestnut (Castanea sativa) is grown in Europe. Chestnuts play an important role in nut production and forest ecosystem services³. Chinese chestnuts are widely cultivated in 26 provinces of China⁴, with a nut yield of 1,562,685 tons, accounting for 74% of the global chestnut production, ranking first in the world in 2022⁵.

Recent advancements in high-throughput sequencing technologies have enabled the continuous accumulation of genomic data for Castanea species. However, existing resources are relatively scattered and lack integration, limiting researchers’ ability to conduct in-depth studies on Castanea genetics, functional genomics, and evolutionary biology.

To address this gap, we have gathered and analyzed a comprehensive collection of genomic datasets for Castanea. The datasets include genomic information from eight Castanea species, 213 RNA-Seq samples, and 330 resequencing samples and are publicly available on figshare. This curation and validation process ensures the reliability and utility of the data for researchers.

Additionally, the CGD serves as a complementary platform to enhance the usability of these datasets. The CGD provides a user-friendly interface and a suite of advanced data mining and analysis tools, including BLAST, Batch Query, GO/KEGG Enrichment Analysis, and Synteny Viewer. These tools facilitate the exploration and analysis of the curated datasets, enabling researchers to investigate deeper into Castanea genetics and functional genomics.

We envision the CGD as an essential resource for advancing functional genomic research and understanding the evolutionary relationships within the Castanea genus. By making these datasets and tools readily accessible, we aim to facilitate collaborative research and drive innovation in the field.

Methods

Data collection

The CGD contains eight chestnut genomes, including seven from C. mollissima (‘HBY-2’^6,7, ‘N11-1’⁸, ‘Sun’⁹, ‘Vanuxem’¹⁰, ‘drought-resistant’(H7)¹¹, ‘early-maturing’(ZS)¹¹ and ‘easy-pruning’(YH)¹¹) and one from C. crenata¹² (Table 1). Genomic information for chestnut, including the genome sequences, mRNA, gene structure annotations in general feature format (GFF), coding (CDS) and protein sequences of protein-coding genes, were obtained based on previously published articles. We collected the genome of ‘Vanuxem’ from the Hardwood Genomics website (now relocated to the TreeGenes Database, https://treegenesdb.org/), which includes several transcriptome libraries from trees affected by chestnut blight disease^10,13. The TreeGenes Database also contains data for eight chestnut species. However, it is not specific to chestnut, it includes data for multiple tree species. Noted that the C. mollissima cv. HBY-2 we previously generated using Pacific Biosciences single-molecule sequencing technology⁶. We performed Hi-C analysis on the Chinese chestnut genome HBY-2 to enhance the genome sequence assembly and contiguity⁷. Therefore, this version exhibits superior genome contiguity and sequence quality compared to the previous version. Besides, RNA-Seq data from various samples, which are at different tissues, developmental stages, as well as cultivars (Supplementary Table 1) were downloaded from the NCBI SRA databese. And resequencing datasets from 330 accessions were downloaded from the NCBI database^7,9.

Table 1 List of collected Castanea genomes.

Full size table

Data processing

Gene functional annotation

In our previous study, we developed a standard pipeline for comprehensively annotating predicted protein-coding genes¹⁴. In brief, the protein sequences of the predicted genes were analyzed against the UniPort (Swiss-Port and TrEMBL), and NCBI nonredundant (nr), and Arabidopsis protein (TAIR) databases using DIAMOND¹⁵ with an E-value cutoff of 1e-4. Furthermore, all the protein sequences were compared against the InterPro database to identify functional domains by InterProScan¹⁶. For the purpose of conducting functional enrichment analyses and generating GO and KEGG pathway annotations, protein sequences were aligned with the EggNOG database using eggnog-mapper¹⁷. The GO terms assigned to genes/transcripts based on the eggnog-mapper results were transformed into the GO Annotation File (GAF) format. Within the eggnog-mapper outcomes, certain KEGG pathways unrelated to plants were discarded. The iTAK program was employed to identify transcription factors (TFs), transcriptional regulators (TRs), and protein kinases(PKs) from the predicted protein-coding genes and to classify them into different families¹⁸.

After the above analysis, we obtained files containing homologous genes identified by BLAST, protein functional domains, AHRD-based functional descriptions, and GO/KEGG annotations, which have been shared in our database and on Figshare.

RNA-Seq analysis

A specific pipeline was utilized to procedure and analyze hundreds of RNA-Seq datasets, which were downloaded from the NCBI SRA Database. Firstly, raw RNA-Seq reads were processed using the FastQC software (v0.11.9)¹⁹ to evaluate the quality of reads, then treated to remove adaptor and low-quality sequences using Trimmomatic²⁰, Trimmed reads shorter than 80% of their initial length were removed, and then the remaining cleaned reads were aligned to the reference genome (HBY-2) using the STAR (version 2.7.10b)²¹. Finally, read counts for each gene were calculated based on the alignments and normalized to fragments per kilobase of transcript per million mapped fragments (FPKM) values²².

Following the analysis, we generated the raw and normalized expression matrices and made them available both in our database and on Figshare.

Variant identification

To identify variants, a pipeline developed by Sentieon Inc.²³ was used with default parameters. First, quality evaluation utilizing the FastQC software (v0.11.9)¹⁹, and resequencing reads were processed to remove adapter and low-quality sequences using Trimmomatic²⁰. Then, the cleaned reads were aligned to the reference genomes (HBY-2) utilizing the BWA-MEM algorithm with default parameters²⁴. Following alignment, duplicated reads were removed using the ‘LocusCollector’ and ‘Dedup’ algorithms of Sentieon. Finally, variants were called using the ‘Haplotyper’ algorithms of Sentieon software (https://www.sentieon.com/).

As a result of the analysis, we produced the vcf files and have shared them through our database and Figshare.

Data Records

All functional annotations, expression profiles, and called variant data have been uploaded to Figshare (https://figshare.com/s/d8505ab07724a111b1f3)²⁵, where researchers can access the required files. Additionally, the expression profiles are available on the Gene Expression Omnibus (GEO) under accession numbers GSE284510²⁶, GSE284516²⁷, GSE284517²⁸, and GSE284518²⁹. The variant data are accessible through the European Variation Archive (EVA)³⁰. On Figshare, we have organized the data into three main categories: genomic data, gene expression data, and variant data.

The genomic data category includes comprehensive functional annotations for all collected genes. These annotations cover homologous genes identified by BLAST, protein functional domains, AHRD-based functional descriptions, and GO/KEGG annotations. The files are stored in a compressed archive named “genome_anno.tgz,” with subfolders named according to the respective genomes. Additionally, the genome files for HBY-2 have also been uploaded in this section. The gene expression data category consists of both the raw and normalized expression matrices. These files are stored in a compressed archive named “gene_expression.zip.” The variant data category includes VCF files for variant data, all starting with “variants_data.” It encompasses 330 samples from the GBS (Genotyping-by-Sequencing) and resequencing projects.

Additionally, all genomic sequences, functional annotations, expression profiles, and called variant data can be downloaded from the CGD under the “Tools - > Download” module. The sample information for the hierarchical clustering heatmap can be found in Supplementary Table 1. Notably, the updated data for HBY-2 has also been shared on Figshare (https://figshare.com/articles/dataset/Wild_Chinese_chestnut_genome_V2_/28098758).

By organizing the data in this structured manner, we aim to facilitate easy access and usability for researchers interested in chestnut genomics and related studies.

Technical Validation

The integrity of genomes

We evaluated genome integrity using BUSCO (version 3)³¹, the BUSCO analysis is based on conserved orthologous genes among species. In this study, simultaneous analysis of the genomes of eight varieties was conducted, and the proportions of identified complete genes in the BUSCO database (with a total of 1,614 genes) were 98.0%, 97.6%, 94.0%, 98.5%, 95.6%, 94.3%, 98.5%, and 92.6%, respectively (Table 2), the results indicate that the genomes integrity are satisfactory.

Table 2 BUSCO assessment of the completeness of genomes.

Full size table

RNA-Seq data

Among the 213 samples, 91.81% of the clean reads were mapped to the reference genome (HBY-2). To ensure the accurate reflection of expression across diverse tissues and developmental stages, hierarchical clustering analysis (Fig. 1) based on FPKMs was conducted using R, derived from various RNA-seq items of chestnut (Supplementary Table 1). The heatmap is divided into four color blocks from top to bottom: the first block represents seed kernel and embryo at relatively early stages, the second block represents leaves, buds, galled leaves, etc., at similar stages, the third block indicates seed kernel and embryo at relatively later stages, and the fourth block includes somatic embryo, embryo, root, and callus at similar stages. Among the 213 samples shown in the heatmap, nine are not clustered within the expected blocks (marked with red horizontal lines in the figure). Overall, similarly developmental stages originating from the same tissue exhibited cohesive clustering patterns. For example, the seed kernel is similar to the tissue of the embryo, and at similar developmental stages, they exhibit cohesive clustering patterns. The same reasoning applies to galled leaves, leaf, insect galls, and some of the tissue of the bud, which indicates the reliability of the data we collected.

Resequencing data

To verify the accuracy of Sentieon in detecting varinsts, several samples were randomly selected from the resequencing data and GATK pipeline was used to identify SNPs on them. Briefly, the cleaned reads were aligned to the reference genomes (HBY-2) using the BWA-MEM algorithm with default parameters²⁴. Next, the aligned reads were processed to remove duplicated reads using the MarkDuplicates algorithm from the Picard. After mapping, 93.81% of clean reads were mapped to the reference genome (HBY-2). The variant were called using the ‘HaplotypeCaller’ algorithms of GATK3.8, cross-validated with Sentieon, yielding results of 12,631,432 and 12,314,460 respectively (Table 3). The concordance rate of common SNPs identified by Sentieon and GATK3.8 software packages stands at 99.943% and 97.435%, respectively, indicate a high level of consistency in the results.

Table 3 Statistics of SNPs identified by Sentieon and GATK.

Full size table

Usage Notes

The CGD serves as a complementary platform to primary datasets hosted on Figshare and other repositories, enhancing data accessibility and utility. The following sections provide a concise overview of how the CGD enables efficient analysis and exploration of the aforementioned datasets.

Each gene feature page in the CGD offers basic gene details, an interactive genome browser, and access to mRNA, protein, and genomic sequences, along with homologous genes, functional annotations, protein domains, and syntenic blocks (Fig. 2). To facilitate dataset storage, browsing, and querying, we have implemented a versatile gene search tool that supports flexible queries based on selected genomes from a drop-down menu. Users can quickly locate specific genes of interest by entering keywords such as gene names, functional descriptions, transcription factor or protein kinase family names, or GO/Pfam terms. The search interface is user-friendly, featuring auto-suggestions to streamline the selection process (Fig. 3A).

The CGD also offers a suite of advanced data mining and analysis tools to maximize the utility of our datasets for comparative genomics, gene function discovery, and molecular breeding. For enrichment analysis, BLAST-indexed databases are organized into nucleotide and protein categories. Nucleotide databases include indexes for genomic, mRNA, and CDS sequences, while protein databases cover all available protein sequence indexes (Fig. 3B).

Furthermore, we have introduced the Synteny Viewer, a tool designed to simplify the identification and visualization of homologs within specific regions of the Castanea genome (Fig. 3C).

For more information, please visit the CGD at http://castaneadb.net. Detailed descriptions of the site’s features and functionalities are available directly on the platform.

Code availability

The gene functional annotation pipeline is available on GitHub (https://github.com/fjl23/CGD_annotation-pipeline). All software and pipelines were executed according to the manual and protocols of the published bioinformatics tools. The version and code/parameters of software have been detailed and described in Methods.

References

Bounous, G. The chestnut: a multipurpose resource for the new millennium. III International Chestnut Congress 693, 33–40 (2005).
Google Scholar
Li, Q. et al. Effect of cooking methods on nutritional quality and volatile compounds of Chinese chestnut (Castanea mollissima Blume). Food Chemistry 201, 80–86 (2016).
Article PubMed CAS Google Scholar
Ji, F. et al. Construction of a SNP-Based High-Density Genetic Map Using Genotyping by Sequencing (GBS) and QTL Analysis of Nut Traits in Chinese Chestnut (Castanea mollissima Blume). Frontiers in plant science 9, 816 (2018).
Article PubMed PubMed Central Google Scholar
Zhang, Y., Liu, L., Liang, W. & Zhang, Y. China Fruit Monograph: Chinese Chestnut and Chinese Hazelnut Volume. (China Forestry Press, 2005).
Food and Agriculture Organization of the United Nations. FAOSTAT statistics database. https://www.fao.org/faostat/en/#data/QCL (2018).
Xing, Y. et al. Hybrid de novo genome assembly of Chinese chestnut (Castanea mollissima). GigaScience 8 (2019).
Liu, Y. et al. Beta-amylase and phosphatidic acid involved in recalcitrant seed germination of Chinese chestnut. Frontiers in plant science 13, 828270 (2022).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. Construction of Pseudomolecules for the Chinese Chestnut (Castanea mollissima) Genome. G3 (Bethesda, Md.) 10, 3565–3574 (2020).
Article PubMed CAS Google Scholar
Sun, Y., Lu, Z., Zhu, X. & Ma, H. Genomic basis of homoploid hybrid speciation within chestnut trees. Nature communications 11, 3375 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Staton, M. et al. A reference genome assembly and adaptive trait analysis of Castanea mollissima ‘Vanuxem,’a source of resistance to chestnut blight in restoration breeding. Tree Genetics & Genomes 16, 1–23 (2020).
Article Google Scholar
Hu, G. et al. Pan-genome analysis of three main Chinese chestnut varieties. Frontiers in plant science 13, 916550 (2022).
Article PubMed PubMed Central Google Scholar
Shirasawa, K. et al. Chromosome-level genome assembly of Japanese chestnut (Castanea crenata Sieb. et Zucc.) reveals conserved chromosomal segments in woody rosids. DNA Research 28 (2021).
Staton, M. et al. Substantial genome synteny preservation among woody angiosperm species: comparative genomics of Chinese chestnut (Castanea mollissima) and plant reference genomes. BMC genomics 16, 744 (2015).
Article PubMed PubMed Central Google Scholar
Zheng, Y. et al. Cucurbit Genomics Database (CuGenDB): a central portal for comparative and functional genomics of cucurbit crops. Nucleic acids research 47, D1128–d1136 (2019).
Article PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature methods 12, 59–60 (2015).
Article PubMed CAS Google Scholar
Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic acids research 47, D351–d360 (2019).
Article PubMed CAS Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research 47, D309–d314 (2019).
Article PubMed CAS Google Scholar
Zheng, Y. et al. iTAK: A Program for Genome-wide Prediction and Classification of Plant Transcription Factors, Transcriptional Regulators, and Protein Kinases. Molecular plant 9, 1667–1670 (2016).
Article PubMed CAS Google Scholar
Brown, J., Pirrung, M. & McCue, L. A. FQC Dashboard: integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool. Bioinformatics (Oxford, England) 33, 3137–3139 (2017).
PubMed CAS Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England) 30, 2114–2120 (2014).
PubMed CAS Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29, 15–21 (2013).
ADS PubMed CAS Google Scholar
Florea, L., Song, L. & Salzberg, S. L. Thousands of exon skipping events differentiate among splicing patterns in sixteen human tissues. F1000Research 2, 188 (2013).
Article PubMed PubMed Central Google Scholar
Freed, D., Aldana, R., Weber, J. A. & Edwards, J. S. The Sentieon Genomics Tools–A fast and accurate solution to variant calling from next-generation sequence data. bioRxiv, 115717 (2017).
Heng, L. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Zheng, Y. Castanea Genome Database (CGD): a comprehensive database of chestnut genome. figshare https://doi.org/10.6084/m9.figshare.27060067 (2024).
NCBI Gene Expression Omnibus (GEO) https://identifiers.org/geo:GSE284510 (2024).
NCBI Gene Expression Omnibus (GEO) https://identifiers.org/geo:GSE284516 (2024).
NCBI Gene Expression Omnibus (GEO) https://identifiers.org/geo:GSE284517 (2024).
NCBI Gene Expression Omnibus (GEO) https://identifiers.org/geo:GSE284518 (2024).
European Variation Archive (EVA) https://identifiers.org/ena.embl:ERP170716 (2025).
Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular biology and evolution 35, 543–548 (2018).
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work was supported by grants from the National Key Research & Development Program of China (2018YFD1000605) and 2020 Connotation Development Quota Project of Beijing University of Agriculture (5066516001/003) to Y.Xing; as well as Beijing University of Agriculture Start-up fund and Enhancing Technology Innovation Capability Project of Beijing University of Agriculture (5076016192/054) to Y.Zheng.

Author information

These authors contributed equally: Jialu Fan, Yu Zhang.

Authors and Affiliations

Beijing Key Laboratory for Agriculture Application and New Technique, College of Plant Science and Technology, Beijing University of Agriculture, Beijing, 102206, China
Jialu Fan, Yu Zhang, Xinghua Nie, Yang Liu, Shangxiao Wei, Haixu Peng, Hanlei Li, Mingjun Zhang, Sen Wang, Ling Qin, Yi Zheng & Yu Xing
Bioinformatics Center, Beijing University of Agriculture, Beijing, 102206, China
Jialu Fan, Shangxiao Wei, Haixu Peng, Hanlei Li, Mingjun Zhang, Sen Wang & Yi Zheng
Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing, 102206, China
Yu Zhang, Xinghua Nie, Yang Liu, Ling Qin & Yu Xing
Library, Beijing University of Agriculture, Beijing, 102206, China
Lu Ning
Ancient Tree Health and Culture Engineering Technology Research Center, National Forestry and Grassland Administration, Beijing, China
Sen Wang & Yi Zheng

Authors

Jialu Fan
View author publications
Search author on:PubMed Google Scholar
Yu Zhang
View author publications
Search author on:PubMed Google Scholar
Xinghua Nie
View author publications
Search author on:PubMed Google Scholar
Yang Liu
View author publications
Search author on:PubMed Google Scholar
Shangxiao Wei
View author publications
Search author on:PubMed Google Scholar
Haixu Peng
View author publications
Search author on:PubMed Google Scholar
Hanlei Li
View author publications
Search author on:PubMed Google Scholar
Mingjun Zhang
View author publications
Search author on:PubMed Google Scholar
Lu Ning
View author publications
Search author on:PubMed Google Scholar
Sen Wang
View author publications
Search author on:PubMed Google Scholar
Ling Qin
View author publications
Search author on:PubMed Google Scholar
Yi Zheng
View author publications
Search author on:PubMed Google Scholar
Yu Xing
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Xing and Y.Zheng designed and managed the project; Y.Zhang, S.Wei and J.Fan constructed the database; J.Fan, Y.Zhang, X.Nie, Y.Liu, H.Peng, H.Li, M.Zhang, L.Ning and S.Wang collected and analyzed the data; J.Fan, Y.Zheng and Y.Xing wrote and revised the manuscript.

Corresponding authors

Correspondence to Yi Zheng or Yu Xing.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary table 1 (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Fan, J., Zhang, Y., Nie, X. et al. Comprehensive curation and validation of genomic datasets for chestnut. Sci Data 12, 860 (2025). https://doi.org/10.1038/s41597-025-05162-x

Download citation

Received: 13 May 2024
Accepted: 09 May 2025
Published: 24 May 2025
Version of record: 24 May 2025
DOI: https://doi.org/10.1038/s41597-025-05162-x