Background & Summary

The family Cistaceae comprises eight genera and 180 species commonly known as rockroses, with numerous taxa of considerable biochemical, biological, and ecological importance1,2. Helianthemum Mill. is the largest genus in this family, forming a monophyletic, complex, and species-rich Palearctic plant lineage with approximately 140 species and subspecies, ranging from Macaronesia to Central Asia3,4. The rapid evolutionary diversification of Helianthemum is centred in the Mediterranean region and has been driven by major paleoclimatic and geological events, especially since the Upper Miocene5. Ecological divergence has played a key role in promoting reproductive isolation and driving phenotypic differences between lineages within this genus4,6. Nowadays Helianthemum species can grow under severe aridity conditions in deserts, alpine pastures, and Mediterranean maquis, where they are exposed to drastic environmental conditions. It is not surprising that several lines of evidence point to Helianthemum as an attractive model system for studying incipient speciation and the evolution of species complexes. These complexes are characterized by a notable degree of intraspecific taxonomic diversity, primarily manifested in vegetative differentiation. Among the most significant taxonomic characters are those related to leaf morphology, particularly the presence, shape, type, and abundance of trichomes7. Additionally, Helianthemum leaves are a natural source of compounds with biological, aromatic, and pharmacological properties, showing a taxon-dependent variation in the chemical profile8. The leaves are rich in phenolics and flavonoids, which regulate the expression of various cytoprotective genes against inflammation and oxidative stress, and contain secondary metabolites with antiparasitic, antibacterial, and antifungal activity9,10,11,12,13.

Differential gene expression can evolve rapidly and become the basis for ecological divergence14,15. Therefore, studying expression divergence may help elucidate phenotypic differences observed even among recently diverged lineages16,17. However, the metabolic pathways of leaf compounds in Helianthemum have never been studied using a transcriptomic approach. Despite its ecological and biochemical significance, as well as its suitability as a model for studying niche adaptation and speciation, extraordinarily few genomic resources are available for Helianthemum18. No transcript sequence information for leaf tissues in the genus Helianthemum is readily available, and the only existing de novo transcriptome assembly within this genus is limited to root tissues of H. almeriense artificially inoculated with mycorrhizae19.

In this study, we report the first de novo assembled transcriptome of Helianthemum leaves, specifically for H. marifolium (L.) Mill., a species that is geographically restricted to the south and east of the Iberian Peninsula and southern France. Phylogenomic data support the existence of four recognised subspecies within this species, which exhibit intricate morphological variations with a genetic basis, potentially associated with environmental adaptation4,20. The transcriptome assembled in this study is a helpful resource for functional genomics and the elucidation of molecular processes, as well as for studying differentially expressed genes involved in both macro- and micro-evolution and local adaptation. Moreover, it offers valuable molecular resources. For instance, the transcriptome developed here can be mined for the identification of EST-SSR markers, which can be applied to assess genetic variation patterns in the numerous threatened and critically endangered species within the genus Helianthemum. This resource may prove invaluable for the development of effective conservation action plans. Finally, this reference transcriptome will facilitate integrated transcriptome and metabolome analyses, assisting researchers in the analysis of functional genes, including those involved in the taxa-dependent biosynthesis of important secondary metabolites.

Methods

Sample collection

Seeds of the four subspecies of Helianthemum marifolium (andalusicum, marifolium, molle and origanifolium) were collected in Spain from eight wild populations (two populations for each subspecies; Table 1). The use of multiple subspecies is suitable to address the physiological variability observed within a species. Indeed, some of these infraspecific taxa possess notable ethnopharmacological value, containing flavonoids and exhibiting a higher polyphenol content than other Helianthemum species, with a probable subspecies-dependent profile8,21.

Table 1 Geographical location of the H. marifolium seeds sampled for greenhouse cultivation.

The seeds were subjected to mechanical scarification with fine-grained sandpaper and germinated on sterile Petri plates with moist filter paper. Five days later, the seedlings were transferred to soil-filled pots containing a 3:1:1 ratio of soil:sand:perlite, and cultivated in a greenhouse at 22/25°C (night/day), 40–60% relative humidity, and natural daylight. At 446 days after sowing, two adult plants from each population of origin (i.e., four replicates of every subspecies, 16 plants in total) were selected for RNA isolation. The leaves from the 3–5 apical nodes were frozen immediately in liquid nitrogen and stored at −80 °C until processing at Novogene Company (UK, Cambridge).

RNA extraction, RNA-seq library construction and sequencing

The RNA was extracted using the RNeasy Plant Mini Kit (QIAGEN, Crawley, UK), following the manufacturer’s instructions. Messenger RNA was purified from total RNA using poly-T oligo-attached magnetic beads. Following fragmentation, the first-strand cDNA was synthesised using random hexamer primers followed by the second-strand cDNA synthesis. The library was prepared following the steps of end repair, A-tailing, adapter ligation, size selection, amplification, and purification. The Novogene NGS RNA Library Prep Set was employed for library preparation.

The library was evaluated using Qubit and real-time PCR for quantification and with the Agilent 5400 Fragment Analyzer system (Agilent, USA) for size distribution detection. The quantified libraries were subsequently pooled, and RNA sequencing was conducted using an Illumina platform (Novaseq 6000), resulting in the generation of paired-end 150 bp reads.

RNA-Seq read quality control and cleaning

An overview of the bioinformatic workflow is shown in Fig. 1. The raw sequence data were subjected to quality control checks using FastQC v0.12.1 software. Erroneous K-mers were corrected with Rcorrector v1.0.722 and paired-end reads were trimmed using Fastp v1.0.723 to remove adapters, with a Phred quality score of 20 selected. An additional cleaning step was undertaken to remove rRNA using SortMeRNA v4.3.624 with the mr_v4.3 database serving as the reference. Finally, potential contaminants were removed with kraken2 v2.1.325 using the PlusPF and nt databases (see also Technical Validation section).

Fig. 1
figure 1

Flowchart illustrating the bioinformatic steps and tools used for analysis, from quality trimming and filtering to transcriptome assembly and annotation.

Normalization, de novo transcriptome assembly and annotation

The cleaned read files of the 16 sampled individuals were concatenated into single files for the forward and reverse reads, respectively, and normalized in silico with the Perl script InSilico_read_normalization.pl from Trinity v2.8.626. The normalized files were subjected to five different assemblies, one with Trinity and four with Oases v0.2.827 with different k-mers (25, 31, 41, 51). All the assemblies were merged into a single file and subjected to the tr2aacds.pl pipeline of EvidentialGene v2023.07.1528 which is designed to obtain the optimal biologically useful “best” set of mRNAs (non-redundant, containing the best-assembled transcripts from each assembler) from several assemblies performed by different methods. For this study, only the “okay set” was selected for further analysis.

The annotation was carried out using TransDecoder v5.7.1 (https://github.com/TransDecoder/TransDecoder)29 to predict candidate coding regions. BLASTx v2.5.0 was used for the similarity search based on the homology of transcripts and predicted proteins with the latest version of the UniProtKB/Swiss-Prot database. The search was conducted with a maximum E-value threshold of 1e-5 and 59524 transcripts returned a positive BLAST hit against the database. The Gene Ontology (GO) annotations were performed using Trinotate v4.0.230, resulting in 44188 transcripts assigned to different GO terms related to the three primary categories: cellular component, molecular function, and biological process. The most abundant GO terms annotated are shown in Fig. 2.

Fig. 2
figure 2

Bar chart of the most frequent (top 10) Gene Ontology (GO) annotated terms associated to the obtained transcripts, corresponding to the different GO categories (a) transcripts assigned with GO terms in Cellular Component, (b) transcripts in Molecular Function, and (c) transcripts in Biological Process. Each category is sub-categorized in different GO terms, represented on y-axis and numbers of transcripts are shown in x-axis.

Mapping, count table, and expression analyses

The software Salmon v1.10.031 was used to map reads against the reference transcriptome and generate equivalence classes for reads from each sample. Corset v1.0932 was employed to obtain gene-level counts rather than transcript-level counts, resulting in a count matrix. This was then imported into R for expression analyses, while a file containing the clusters was used for gene annotation.

The count data were input into edgeR v4.4.033 for filtering and normalisation using the TMM method. Exploratory analyses were then performed using FactoMineR v2.1134, including a principal component analysis (PCA) and a cluster analysis on the normalised counts. Preliminary expression results obtained with the PCA of gene expression and the hierarchical clustering of samples revealed a well-defined separation of the four subspecies, and a similar gene expression pattern within subspecies (Fig. 3). The transcriptome presented here adds a valuable resource for comparative genomics studies in the genus Helianthemum expanding molecular data for evolutionary studies in the family Cistaceae and can be useful to uncover single nucleotide polymorphisms (SNPs) in the coding regions of the genome. The accurate transcript annotation will enable us to figure out the gene function of particular traits of interest and expression profiling, providing indispensable transcriptomic resources for future studies on the cell signalling pathways.

Fig. 3
figure 3

Similarity analysis of RNA-seq samples. (a) Principal component analysis of gene expression, with samples from different subspecies highlighted in different colours. (b) Hierarchical clustering of samples, with each subspecies’ samples highlighted in different colours. The numbers in the labels of each sample represent the population code (see Table 1) followed by the individual identifier.

Data Records

The raw reads of the 16 samples used for the transcriptome assembly are deposited in the NCBI Sequence Read Archive (SRA) database with accession number SRP52272735. The assembled and curated transcriptome, the predicted peptide sequences from TransDecoder and the gene ontology annotations from Trinotate can all be found in the Zenodo public repository36. The raw gene-level count matrix and the assembled transcriptome are also available at the Gene Expression Omnibus (GEO) database under accession GSE29193537.

Technical Validation

Assembly quality control

Read quality assessment was conducted using FastQC. Transcriptome assembly validation and completeness were performed using Benchmark Universal Single-Copy Orthologs (BUSCO) v5.7.038 and Quast v5.0.239. Our final assembled transcriptome had a total of 122,002 contigs (of which the largest size was 15,683 bp), a N50 of 1533 bp (Table 2), and 88.4% of the 2805 genes corresponding to BUSCOs database eudicotyledons_odb12 (Completeness: 88.4% [Single copy: 69.4%, Duplicates: 19.0%], Fragmented: 5.8%, Missing: 5.8%).

Table 2 De novo assembly statistics of the H. marifolium transcriptome.

Removal of contaminant reads prior to transcriptome assembly is an important step to ensure the quality of the delivered transcriptome. In our study, of the 487,286,173 forward sequences processed with kraken225, 14,978,328 were classified and eliminated with the first (PlusPF) database, and 1,005,362 with the second (nt) database, retaining a total of 471,302,483 sequences for normalization. As for the 487,086,862 reverse sequences, 14,997,325 were classified and eliminated with the first database, and 998,892 with the second, leaving a total of 471,090,645 sequences for normalization.