Background & Summary

Nicotiana attenuata and Nicotiana obtusifolia are diploid (2n = 2x = 24) Solanaceae species, which have been developed as ecological model plants to study plant-environment interactions in nature1,2,3. In particular, N. attenuata has been extensively used to understand the function, evolution and diversification of plant specialized metabolites (PSM), such as nicotine, as a defensive response to herbivory and other ecological factors, using genomics, transcriptomics, and metabolomics3,4,5. Further, N. attenuata has been used to understand the balance between plant defence and protection of autotoxicity conferred by the toxic PSMs6. Similarly, N. obtusifolia is also used to study ecological interactions of plants with pathogens, herbivores, and pollinators in natural habitats, along with N. attenuata and others3. Previous omics studies have used N. obtusifolia to understand the genetic basis of metabolic innovation and diversification of plant specialized metabolites in the Nicotiana genus, which are produced as a defence response to environmental stress7,8,9.

With the significance of these species as ecological models, and the necessity of future genomics-guided multi-omics strategies to further understand the metabolic innovations at the level of both plant organ and developmental stage, high-quality genomes are essential for comparative analysis10. A large number of genomic, transcriptomic, and metabolomic datasets already exist for N. attenuata11. However, high-quality genome assemblies were still not available for both of these Nicotiana species. The first published draft genomes of N. attenuata and N. obtusifolia had low assembly contiguities (N50 values of 524.5 Kb and 134.1 Kb, respectively)9. Currently, two genome assemblies of N. attenuata are available, superscaffolded into 12 linkage groups using optical mapping and genetic mapping data (NCBI GenBank accessions: GCF_001879085.1 and GCA_030864195.1); however, several issues remain, including fragmented chromosomes and a high percentage of gaps (Supplementary Table 1).

Here, we used the existing N. attenuata genome assembly (GCA_030864195.1, assembled from PacBio reads)12 to reconstruct the chromosomes based on chromatin interaction data obtained from Hi-C sequencing, and report the improved chromosome-level genome assembly of N. attenuata (12 chromosomes, 2.07 Gb) (Fig. 1a, Table 1)13. Compared to the previously available assemblies, our Hi-C data-based assembly showed an increased genomic length covered by the chromosomes, a reduction in the number of ambiguous bases (N), a reduction in the total number and length of gaps, clean Hi-C contact maps with no evidence of mis-joins, and a better accuracy of the genome assembly (Supplementary Table 1). Further, we performed a de novo genome assembly of N. obtusifolia using PacBio sequencing data, and anchored it to chromosomes using Hi-C data, reporting the first chromosome-scale N. obtusifolia genome (12 chromosomes, 1.29 Gb) (Fig. 1b, Table 1). The improvements in the genome assembly statistics compared to the previous contig-level draft assembly of N. obtusifolia are mentioned in Supplementary Table 2.

Fig. 1
Fig. 1
Full size image

Circos plots showing the genomic features (window size 100 Kb)13 – (a) N. attenuata, (b) N. obtusifolia. From the outer to inner circle – I. Chromosome length, II. Percentage of repeat regions, III. Percentage of gene regions, IV. GC content.

Table 1 Genome assembly and annotation statistics of the Nicotiana genomes.

Repetitive regions contributed to 83.29% and 79.21% of the N. attenuata and N. obtusifolia genome assemblies, respectively. It is noteworthy to mention that the usage of long PacBio reads and the reads obtained from high-throughput chromatin conformation capture (Hi-C) technology enabled the improvement of the assembly quality of such repetitive genomes, which was a limitation in the previous studies9,12. After soft-masking the genome assemblies, we identified 35,166 and 27,352 protein-coding genes in N. attenuata and N. obtusifolia, respectively, using the BRAKER3 pipeline14 (Table 1).

The chromosome-level genome assemblies of N. attenuata and N. obtusifolia presented in this study will serve as a valuable dataset, facilitating a better understanding of the genomic basis of metabolic innovations not only in Nicotiana but also in the Solanaceae family.

Methods

Sample preparation and sequencing

N. attenuata (NCBI TaxID: 49451) plants were grown from Max Planck Institute for Chemical Ecology (MPI-CE) seed stock accession UT31, an inbred ‘UT’ line (31st inbred generation) derived from seeds originally collected at DI Ranch (Santa Clara, Utah, USA) in 1988. UT31 was the same N. attenuata accession that was used in the previous study12. N. obtusifolia (NCBI TaxID: 200316) plants were grown from MPI-CE seed stock Nob02, an inbred line derived from seeds originally collected at Lytle ranch preserve (Santa Clara, Utah, USA) in 2004. The N. attenuata and N. obtusifolia plants used for sequencing were made available as herbarium vouchers at the Johannes Gutenberg University Botanical Garden Herbarium (Herbarium MJG) with accessions MJG 048398 and MJG 048399, respectively (Supplementary Figures 1, 2).

For sequencing, high-quality genomic DNA was extracted from frozen leaf samples using a modified CTAB method15. RNase A was used to remove RNA contaminants. For N. attenuata, a Hi-C fragment library (insert size 300–700 bp) was constructed using the Mate-pair kit after chromatin cross-linking with formaldehyde and enzymatic digestion with DpnII. The library was then sequenced using the Illumina NovaSeq platform. For N. obtusifolia, whole-genome sequencing was performed using the PacBio Revio platform, and for Hi-C data-based contig anchoring, a Hi-C fragment library with 300–700 bp insert size was constructed using the same methods as N. attenuata, and sequenced using the Illumina NovaSeq platform. PacBio Iso-seq RNA-seq library was also constructed using total RNA from the leaf sample, which was sequenced on a PacBio Sequel II platform. The sequencing was carried out at Biomarker Technologies (Beijing, China).

Genome assembly

For N. attenuata, 301.92 Gb clean data were generated from the Hi-C library and were mapped to the previously assembled N. attenuata genome (GCA_030864195.1, assembled from PacBio reads)12 using BWA v0.7.17 with the default parameters16. Invalid read pairs, including Dangling-End and Self-cycle, Re-ligation and Dumped products, were filtered out. LACHESIS was used to cluster, order, and orient the contigs utilising the Hi-C interaction signals17. After Hi-C data-based anchoring, 2.07 Gb (94.6%) of the total assembly (2.19 Gb) was in confirmed order and orientation, constituting the 12 chromosomes. The parameters used in LACHESIS were: CLUSTER_MIN_RE_SITES = 161; CLUSTER_MAX_LINK_DENSITY = 2; ORDER_MIN_N_RES_IN_TRUNK = 199; ORDER_MIN_N_RES_IN_SHREDS = 239.

Prior to N. obtusifolia de novo genome assembly, Illumina paired-end reads from the previous study were quality-filtered using Trimmomatic v0.39 with the parameters “SLIDINGWINDOW:4:15 MINLEN:36”, which were then used to construct the k-mer frequency-based histogram (k-mer = 31) using Jellyfish v2.2.10, and to estimate the genomic characteristics using GenomeScope v2 (Supplementary Figure 3)18,19,20. For the genome assembly, high-accuracy CCS data (96.91 Gb, N50 = 20.67 Kb) were assembled using Hifiasm v0.24.0 with the parameters: l = 0 and n = 4, resulting in 329 contigs (Supplementary Table 3)21. For anchoring of the contigs, 197.74 Gb clean Hi-C data were mapped to the assembly using BWA v0.7.17 with default parameters16. Similar to the N. attenuata genome, the invalid read pairs were filtered out, and the valid interaction read pairs were used for the clustering, ordering, and orienting of scaffolds onto chromosomes with LACHESIS17. After Hi-C data-based anchoring, 1.29 Gb (98.99%) of the total assembly (1.3 Gb) was in confirmed order and orientation, constituting the 12 chromosomes. The parameters used in LACHESIS were: CLUSTER_MIN_RE_SITES = 723; CLUSTER_MAX_LINK_DENSITY = 2; ORDER_MIN_N_RES_IN_TRUNK = 15; ORDER_MIN_N_RES_IN_SHREDS = 15.

The unplaced contigs from both N. attenuata and N. obtusifolia genomes were further analysed using BlobTools to check for contamination22. For this analysis, first, the Illumina paired-end data from the previous study9 were quality-filtered using Trimmomatic v0.39 with the parameters “SLIDINGWINDOW:4:15 MINLEN:36”18. The filtered paired-end reads were mapped onto the respective Nicotiana contigs using BWA-MEM16, and the contigs were mapped against the NCBI-nt database using BLASTN, which were then used to construct the blobplots. Only the contigs that were assigned as “Streptophyta”, “undef”, and “no-hit” were retained (Supplementary Figures 4, 5).

Genome annotation

Both N. attenuata and N. obtusifolia genome assemblies were annotated using a similar method. Prior to coding gene prediction, the genome assemblies were used to construct the corresponding de novo repeat libraries using RepeatModeler v2.0.2, with “-LTRStruct” and other default parameters23. The resultant repeat libraries were used to soft-mask the respective genome assemblies using RepeatMasker v4.1.2 (https://www.repeatmasker.org).

The soft-masked Nicotiana genome assemblies were used for the prediction of high-confidence protein-coding genes using BRAKER314, which implements an integrated transcriptome and proteome evidence-based gene prediction in GeneMark-ETP pipeline24. For transcriptome-based evidence of N. attenuata, the RNA-Seq data (Illumina short reads) from the previous study9 were quality-filtered using Trimmomatic v0.39 with the parameters “SLIDINGWINDOW:4:15 MINLEN:36”, and the filtered data were used in BRAKER314,18. Inside the BRAKER3 pipeline, AUGUSTUS25 was also used for the prediction of coding genes, and TSEBRA26 was used for obtaining a combined output of GeneMark-ETP and AUGUSTUS-based gene predictions.

For transcriptome-based evidence of N. obtusifolia, both Illumina short read RNA-Seq data from a previous study3 and the newly generated PacBio Iso-seq ccs data (292,933 reads) were used in two separate BRAKER3 analyses. First, the Illumina short reads were quality-filtered using Trimmomatic v0.39 with the parameters “SLIDINGWINDOW:4:15 MINLEN:36”, and the filtered data were used in BRAKER314,18. Next, the Iso-seq data were mapped onto the genome using Minimap227, which was then used in a separate BRAKER3 analysis. The results from the two BRAKER3 analyses were merged using TSEBRA26.

In all the BRAKER3 analyses for both the Nicotiana species, the protein sequences from the previous draft assembly of N. attenuata9, along with species belonging to Nicotiana and other Solanaceae genera (only chromosome-level assemblies were considered) were used as a set of extrinsic proteome evidence in GeneMark-ETP. These Solanaceae species were - N. sylvestris28, N. tomentosiformis28, N. tabacum28, N. benthamiana29, Capsicum annuum30, Solanum lycopersicum31, Physalis pruinosa32, Datura inoxia33, Lycianthes biflora33, Lycium chinense34, and Iochroma cyaneum35.

The Nicotiana coding gene sets were filtered to extract the longest isoforms for each gene model and to remove the coding genes with a length of <100 bp using AGAT (https://github.com/NBISweden/AGAT). Further, the coding genes with a repeat content of >50% were filtered out, using a method similar to a previous study36. The protein-coding genes were annotated using the eggNOG-mapper genome annotation server, which uses orthology relationships to assign the KO (KEGG Orthology) terms, GO (Gene Ontology) terms, COG (Cluster of Orthologous Gene) categories, CAZy families, and Pfam domains37. Additionally, the coding genes were also mapped against the UniRef90 database using Diamond v2.0.13 with the parameters: “-k 1 -e 0.00001 --f 6–sensitive”38,39.

Evaluation of genome assembly and annotation

Merqury v1.3 was used to assess assembly completeness and accuracy by comparing k-mers from the Nicotiana genome assemblies and the quality-filtered Illumina paired-end reads, with a k-mer size of 21, as predicted by the “best_k.sh” script9,40. Further, the LTR Assembly Index (LAI) score41 was estimated for the chromosome-level genomes of N. attenuata and N. obtusifolia using GenomeTools42 v1.6.1 and LTR_retriever43 v3.0.1.

To evaluate the completeness of the whole-genome assembly and the predicted coding gene sets, BUSCO v5.4.3 was used in “genome” and “proteins” modes, respectively, with the solanales_odb10 database44. The quality of Nicotiana gene model annotations was assessed using GAQET2 to provide metrics produced by AGAT, OMArk, and PSAURON45,46,47.

Data Records

The newly generated sequencing data for N. attenuata have been deposited in the NCBI database with BioProject accession number PRJNA1245670. The Hi-C genomic reads of N. attenuata are available under SRA accession SRR3296404348. Genome and transcriptome sequencing raw data of N. obtusifolia generated in this study have been deposited in the NCBI database with BioProject accession number PRJNA1332718. The SRA accessions for the N. obtusifolia sequencing reads are - SRR35556615 (PacBio Revio genomic reads)49, SRR35556614 (Hi-C genomic reads)50, and SRR35556613 (PacBio Iso-seq transcriptome reads)51. The Whole Genome Shotgun projects have been deposited at DDBJ/ENA/GenBank under the accession JBUBWS00000000052 and JBUCPO00000000053 for N. attenuata and N. obtusifolia, respectively. The versions described in this paper are versions JBUBWS010000000 and JBUCPO010000000 for N. attenuata and N. obtusifolia, respectively. The genome assembly and annotation files are also available at Figshare54,55. The mapping results of the Nicotiana gene sets constructed in this study against those of the previous study9 are also available at Figshare54,55.

Technical Validation

To evaluate the genome assembly quality and completeness, we implemented multiple strategies. First, the Hi-C contact map showed strong intra-chromosomal interaction signals along the diagonal, which confirms the genome structure integrity (Fig. 2). Second, Merqury analysis results showed the k-mer completeness and QV (consensus quality) values of 96.53% and 42.13, respectively, for N. attenuata, and 98.35% and 48.22, respectively, for N. obtusifolia (Supplementary Figures 6, 7). Third, the chromosome-level genome assemblies of N. attenuata and N. obtusifolia had LAI scores of 14.6 and 15.6, respectively, which refer to “Reference”-standard genome assemblies, according to Ou et al.41 (Fig. 3). Fourth, BUSCO analysis showed the presence of 98.5% complete BUSCO genes in both N. attenuata and N. obtusifolia genome assemblies (Table 2). At the proteome level, the BUSCO completeness were 99.3% and 97.9% for N. attenuata and N. obtusifolia, respectively (Table 2). Fifth, the coding gene sets showed PSAURON quality scores of 96.1 and 96.7 for N. attenuata and N. obtusifolia, respectively. Detailed quality metrics for the genome annotation are provided in Supplementary Table 4.

Fig. 2
Fig. 2
Full size image

Hi-C contact maps of the genome assemblies – (a) N. attenuata, (b) N. obtusifolia.

Fig. 3
Fig. 3
Full size image

Distribution of the LAI scores in the 12 chromosomes of – (a) N. attenuata, (b) N. obtusifolia.

Table 2 BUSCO completeness of the Nicotiana genomes.