Background & Summary

Salix L. (commonly known as willows) is the largest genus of woody plants in the Northern Hemisphere within the family Salicaceae s.str., comprising ca. 400–520 species1. Willows have significant economic value due to their use in ornamental use, landscaping, soil engineering, wind prevention, etc1,2,3. Shrub willows have been identified as a promising biomass crop and are widely used for biomass production due to their ease of propagation and ability to grow quickly in short rotation coppice cycles with minimal fertilizer inputs. Therefore, they are considered the most suitable woody bioenergy crops and are widely planted4. To fully utilize the potential of renewable energy, it is important to maintain willows free of pests and diseases while improving yields without significantly increasing the need for fertilizers and water5. Willows are dioecious, though some species are monoecious, making them an excellent taxon for studying plant sex determination evolution in recent years. Willows exhibit both ZW and XY sex systems, and the location of their sex-determining regions is dynamic and varies among species6,7,8,9,10. Furthermore, willows are capable of vegetative reproduction from stumps, branches or roots, which allows for the simple generation of clones with the same genetic material2. These features make them convenient for control experiments aimed at studying gene functions.

Cushion willow (Salix brachista C. K. Schneider) is a common component of alpine subnival assemblages of the genus Salix L. It is a cushion plant with a creeping stem and a large number of lateral branches, growing to a height of usually no more than 5 cm10. Although mostly dioecious, we observed hermaphrodite flowers in one population, indicating that its sex determination region may be undergoing dynamic evolution. This makes it a very suitable system for studying willow sex determination evolution. It is mainly distributed in subnival zones with an elevation of around 4000 m (and occasionally found in lower elevations around 3000 m) in the Hengduan Mountains and adjacent areas, i.e. the eastern Himalaya and middle Yunnan Plateau. These high-altitude areas, such as the alpine subnival zone, are characterized by harsh environments, including strong solar radiation, strong winds, low temperatures, dramatic daily temperature fluctuations, hypoxia, poor soil, and uneven humidity and precipitation11. Plants in the alpine subnival zone, such as the Cushion willow, must cope with the harsh alpine environment. As a result, they have accumulated numerous stress resistance genes in their genome during their adaptive evolution to the alpine environment12,13,14.

In summary, Cushion willow is a suitable system for studying the evolution of plant sex determination, adaptive evolution of alpine plants, mining stress resistance genes in alpine plants, and studying related gene functions. Therefore, this species could have potential value in genetic improvements for willows used as bioenergy crops, in gardening, and as ornamental plants. However, the previously reported genome of Cushion willow still contains several repetitive sequences, such as telomeres and 5S rDNA, which remain un-assembled. Furthermore, there is a limited availability of a gene expression atlas. These factors hinder its potential value as mentioned above.

In this study, we present a high-quality genome and a transcriptome map of the Cushion willow, resolved by haplotype. (a) The haplotype-resolved, chromosome level genome was assembled using PacBio Revio System in circular consensus sequence, Illumina high-throughput chromosome conformation capture sequencing (Hi-C), Illumina high-throughput RNA-seq, and Nanopore full-length transcript technologies. We assembled 38 chromosomes that were classified as haplotype A and haplotype B. This result is consistent with previous karyotyping results (2n = 38)15. Haplotype A has a genome size of 401.5 Mb and contig N50 length of 22.6 Mb, while haplotype B has a genome size of 386.2 Mb and contig N50 length of 21.8 Mb (Table 1). The high-resolution genome annotated to 57,169 genes that contained 53,238 protein-coding genes and 3,931 RNAs (Supplement Table 5) (b) Transcripts were obtained from seven different organs, and full-length transcripts were obtained for mixed samples of these tissues. A total of 28,587 non-redundant transcripts were obtained from seven organs using Illumina sequencing. Additionally, 164.5 million full-length transcripts were obtained through Nanopore sequencing, with a mean read length of 981 bp and an N50 length of 1,194 bp. The Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases were used to annotate these transcripts and determine their function. Additionally, differentially expressed genes (DEGs) were identified in pairs of organs, and 33,414 alternative splicing (AS) events and 36,634 alternative polyadenylation (APA) sites were detected using full-length transcripts, which are more accurate than Illumina data (Figs. 5 and 6). The genome and transcriptome map of Cushion willow, resolved by haplotype, will provide valuable reference material for studying genetic improvement of Cushion willow plants and other alpine Salix species, as well as exploring the adaptive evolution of alpine extreme environments on the Qinghai-Tibet Plateau.

Table 1 Summary of the S. brachista genome assembly data.

Methods

Sample collection

For the extraction of genomic DNA, fresh young leaves of S. brachista were collected from Tianbao Mountain, Shangri-La Country, Yunnan Province, China. Additionally, we collected RNA from seven organs (roots, stems, young leaves, mature leaves, monoecious flowers, female flowers and male flowers) of Cushion willow plants. For each organ, three biological replicates were collected from different plants. The newly obtained materials were promptly frozen in liquid nitrogen. The related sequences were obtained from Kaitleai Mingjing Gene Technology (Beijing) Co., Ltd.

Genome and transcriptome sequencing

For the genome, we extracted total DNA using the CTAB method for sequencing16. Before long-read sequencing, the DNA was purified by DNeasy Plant Mini Kit (Qiagen, Germantown, MD, USA). The purity and integrity of the DNA were subsequently assessed by 1% agarose gel electrophoresis. Qubit 2.0 fluorometer (Life Technologies, Carlsbad, CA, USA) was using to assess the concentration of DNA. Following the positive assessment result, we constructed a PacBio long-read library and generated 38.3 Gb (~2.11 million reads) of HiFi raw data on the PacBio Sequel II platform (Supplementary Table 1). We prepared a Hi-C library following standard protocols17. The library was subsequently sequenced on the Illumina NovaSeq 6000 platform. Approximately 49.7 Gb (330.98 million reads) of Hi-C raw data were obtained (Supplementary Table 1).

For the transcriptome sequencing of seven tissues, total RNA was extracted from 21 individual samples across seven organs using the Plant RNA Kit. The RNA quality and concentration were determined using a Nanodrop spectrophotometer and 1% agarose gel electrophoresis. In order to construct cDNA Illumina libraries, Oligo(dT) magnetic beads were utilized to eliminate rRNA, tRNA, and microRNA from the high-quality RNA. The mRNA was then reverse transcribed to yield cDNA by reverse transcriptase, and poly (A) and adapters were introduced. Finally, the cDNA was amplified by PCR to generate the cDNA library, which was subsequently sequenced on the Illumina NovaSeq. 6000 platform. Approximately 5 Gb of raw data were obtained from each sample (Supplementary Table 2). For Nanopore transcriptome sequencing, we first mixed the total cDNA from the five different organs (roots, mature leaves, young leaves, female flowers, and stems) of Cushion willow. The main step was adding a rapid adapter to each cDNA sequence to construct a 1D full-length library. This 1D library was run on an Oxford Nanopore PromethION. Finally, a total of ~16 Gb (16 million reads) of clean full-length transcripts was obtained (Supplementary Table 1).

Haplotype genome assembly

We assembled the Cushion willow genome by combining PacBio single-molecule real-time long-read sequences, high-throughput chromatin conformation capture (Hi-C) sequences and Illumina short-read sequences from the National Center for Biotechnology Information (NCBI) Sequencing Read Archive (SRA) database (Project: PRJNA472210, Run: SRR9021434)18. First, we used HiFi data to initially assemble contigs using Hifiasm (v_0.16.1-r375)19. The Hi-C reads were compared with the contigs using Juicer20 and assisted in chromosome assembly using 3D-DNA (v_180922)21. We manually checked and adjusted the incorrect assemblies using Juicebox22. After two rounds of assisted assembly and manual correction, haplotype chromosome frames were generated. The gap in the chromosome was subsequently closed using TGS-GapCloser23 software based on the HiFi data (parameters:–ne–min_match 1000). Because of the chromosomes with unassembled telomeres or shorter telomeres, the HiFi reads were reused for comparison with the above-described assembled haplotype chromosomes by Minimap224. Hifiasm software was used to reassemble the sequences near the telomeres, and the contigs obtained from the assembly were compared with the chromosomes to extend the chromosome telomeres outward and assemble a more complete telomere sequence TTTAGGG (Fig. 1)25,26. The assembled genome was subjected to two rounds of error correction using the software Nextpolish27 with Illumina data. Redundans28 was used to remove redundant sequences (rRNA fragments and low average coverage fragments)25. Finally, the two haplotype genomes were fully resolved at the chromosome level. The chromosome number and orientation were renamed according to the chromosome assembly of the Cushion willow published previously15. GetOrganelle (v_1.7.5)29 was used to assemble the chloroplast and mitochondrial genomes. The sizes of the chloroplasts and mitochondria were 155,612 bp and 630,081 bp, respectively (Supplementary Table 3).

Fig. 1
figure 1

The schematic of assemble telomere to telomere.

Gene identification and functional annotation

Before annotating the genomic information, we prepared two datasets for annotation. (1) We used publicly available homologous protein sequences from 17 species of the Salicaceae (S. brachista, S. dunnii, S. purpurea, S. suchowensis, S. viminalis, Populus alba, P. alba var. pyramidalis, P. alba x P. glandulosa, P. davidiana x P. alba var. pyramidalis, P. deltoides, P. euphratica, P. ilicifolia, P. pruinosa, P. simonii, P. tremula, P. tremuloides, P. trichocarpa) and Arabidopsis thaliana to combine 278,011 non-redundant protein sequences as homologous protein evidence for gene annotation. (2) The Oxford Nanopore Technology (ONT) transcriptome data were aligned to the reference genome using Minimap224, followed by the use of Stringtie2 (v_2.2.1)30 to infer the transcript structure and assemble 85,880 transcripts. The Illumina RNA-seq data were downloaded from SRR7341541 in the SRA database18, which included 330,394 transcripts. Using PASA31, the above transcript data were merged into a set of transcript sequences containing 194,516 transcripts with structural annotations. Transposon elements were identified from scratch using EDTA32 (parameters:–sensitive 1–anno 1) to generate a TE library. Then, repeat regions in the genome were identified using RepeatMasker (http://www.repeatmasker.org/RepeatMasker/). A total of 1,138,238 repetitive sequences were identified, with a total length of 420,868,210 bp, accounting for 53.38% of the total genome length (Supplementary Table 4).

In the process of genome annotation, we undertook the following tasks: annotation, integration, and upgrading of genomes. Initially, we employed the MAKER233 annotation pipeline to preliminarily integrate three types of annotations: ab initio gene prediction, homologous protein, and transcript evidence. We then masked repetitive sequence regions in the genome using RepeatMasker (default parameters). The AUGUSTUS34 software was utilized for ab initio prediction of the coding genes. The BLASTN and TBLASTX methods were employed for expression gene annotation, which was based on transcript evidence alignment with the genome. Exonerate35 was used to polish the annotation of expressed genes. Subsequently, AUGUSTUS was run again using the previous prediction results of expressed genes. Alternative splicing and UTR sequences were added to the predicted genes according to transcript evidence. The EVidenceModeler (EVM)36 was then run to integrate the results of gene annotation by MAKER2 and transcript evidence by PASA. Finally, the consistency gene annotation integrated results by EVM underwent two rounds of iterative upgrades using PASA. Upon obtaining these results, we removed coding frames containing stop codons, those without start codons or stop codons, and filtered overly short sequences less than 150 bp. Moreover, we used tRNAScan-SE37 to annotate tRNAs, Barrnap (v_0.9) (https://github.com/tseemann/barrnap) to annotate rRNAs and RfamScan38 to annotate various noncoding RNAs. In summary, we obtained 53,238 protein-coding genes, 1,414 rRNAs, 1,301 tRNAs and 1,216 ncRNAs (Supplementary Table 5).

After the genes were identified, their functional and structural characteristics needed to be annotated. This step involved the annotation of gene function using the homologous gene database eggNOG-mapper39 for GO and KEGG annotation. Furthermore, we obtained structural information about the genes through motif and domain annotations. This was achieved by employing InterProScan40 to compare structural domain similarities based on sub-databases of InterPro, such as the PRINTS, Pfam, SMART, and PANTHER databases. To determine the best alignment of the genes, Diamond41 was used to align protein sequences with several protein databases (Swiss_Prot, TrEMBL, NR and Arabidopsis) using the parameters identity > 30% and E value < 1e-5.

Identification and expression analysis of allele genes

We identified alleles by the AlleleFinder pipeline (https://github.com/sc-zhang/AlleleFinder) based on the above high-quality haplotype genomes. Briefly, allele data were obtained from MCScanX42, GMAP43 and NCBI BLAST + based on two strategies: similarity and collinearity. First, using MCScanX, genes in syntenic regions were considered to be alleles. Then, the GMAP was used for further screening. Subsequently, alleles with more than 80% sequence similarity were recognized as pairs of alleles.

The RNA-seq data from the mature leaves of Cushion willow were used to analyze allele expression in both haplotypes. Initially, the adapters were trimmed, and the low-quality reads (<50 bp) were filtered out using Trimmomatic (v_0.39)44 software. Subsequently, the transcripts of the alleles were aligned to the homologous genome using HISAT2 (v_2.2.1)45 software. The fragments per kilobase of exon model per million mapped reads (FPKM) values were calculated to show allele expression levels via Stringtie2.

Transcript alignment and assembly

To obtain better analysis results, it is essential to filter the RNA-seq raw data by removing the adapters and trimming the low-quality reads via Trimmomatic. Then, through the use of the quality control software FastQC (v_0.11.9) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and MultiQC (v_1.12)46, we obtained information on the clean data, including GC content and base sequence quality (Supplementary Table 2), of each sample. This information helped us to determine whether the results met the downstream analysis requirements.

The clean data from 21 samples were aligned, and the transcripts were assembled. First, HISAT2 was used to align the clean data with the reference genome (choosing the high-quality haploid genome A), thereby obtaining positional information about the transcripts. Then, using SAMtools (v_1.15)47, we obtained sorted binary bam files that revealed detailed information about the alignment results. Moreover, Stringtie2 was used to assemble the aligned reads to obtain the assembled transcripts of each sample. Overall, 28,587 non-redundant transcripts were assembled from seven organs.

It is necessary to conduct a repeated correlation test and eliminate samples with large differences in replicates to ensure the accuracy and value of the results. We computed the Pearson correlation coefficients among three biological replications based on the gene’s FPKM. The results revealed that the lowest Pearson’s correlation coefficient (r2) was 0.65, implying that all other correlation coefficients exceeded this value, this indicates robust replication among the biological replicates (Fig. 2).

Fig. 2
figure 2

Correlation analysis of 21 RNA-seq samples from S. brachista. The dark blue color signifies greater similarity and repeatability between the two replicates.

Determination of relative gene expression levels

The relative expression levels of each gene were calculated for various aspects, including high expression, organ-specific expression and differential expression. All calculations were based on the normalization of gene expression by FPKM values, which eliminated feature length and library size effects. We obtained gene expression tab files with FPKM values using the ‘-A’ parameter of Stringtie2. First, genes were screened and identified as expressed if FPKM > 0.3. Using a cutoff of FPKM > 20, distinctly high expression genes were detected in the root, stem, young leaf, mature leaf, monoecism flower, female flower and male flower of Cushion willow. Then, we subsequently compared and calculated the organ-specific expression of the genes via the R package Venn (v_1.10). Additionally, the differentially expressed genes (DEGs) were identified by the R package DESeq2 (v_1.32.0)48. Unlike FPKM, DESeq2 employs a negative binomial as its reference distribution and offers a unique normalization method called read count. Benjamini and Hochberg’s approach was used to adjust the resulting P values, thereby controlling the false discovery rate (FDR). The DEGs were designated as having an FDR < 0.01 and a log2 |fold change| >  = 1 according to DESeq2.

GO and KEGG enrichment analysis

To predict the functions and pathways of genes across varying expression levels (highly expressed, organ-specifically expressed, differentially expressed), we annotated these genes using the GO and KEGG background files, which consist of three columns of information on gene ID, term and function/pathway extracted from the above genomic annotations. Using the R package clusterProfiler (v_4.0.5)49, the functions and pathways of genes at different expression levels were annotated by the GO and KEGG background files, respectively (pvalue: 0.05, padjustmethod: Benjamini-Hochberg). Finally, we compiled lists of the most notable gene annotations.

Identification of AS events and APA sites

Based on the ONT full-length transcriptome data and the haplotype A genome of S. brachista, we performed an identification analysis of AS events and APA sites. First, we used the python script cdna_classifier.py in Pychopper2 (v_2.7.2) (https://github.com/epi2me-labs/pychopper) to trim and identify full-length transcripts. The full-length transcripts were self-corrected using the Flair (v_2.0.0)50 analysis process. Finally, the above full-length transcripts were utilized for the identification of variable AS events using the software package SUPPA2 (v_2.3)51 with default parameters. AS events were classified into seven categories: skipping exon (SE), mutually exclusive exon (MX), alternative 3’ splice site (A3), alternative 5’ splice site (A5), alternative first exon (AF), alternative last exon (AL) and retained intron (RI). APA sites were identified using LAPA (v_0.0.5)52.

Data Records

The relevant raw data reported in this paper have been deposited in the National Genomics Data Center (NGDC)53, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation under the BioProject accession number PRJCA022812 and PRJCA023075. The raw sequence data of genomic, including PacBio HiFi long-reads, Hi-C reads, have been deposited in the Genome Sequence Archive (GSA)54 at NGDC under the accession number CRA01464255. And raw sequence data of transcriptomic, which includes RNA-seq data from seven organs and Nanopore full-length transcript data, have also been deposited in the Genome Sequence Archive (GSA) at NGDC under the accession number CRA01460756. Genome assembly and annotation data has been deposited in the Genome Warehouse (GWH) in NGDC under the accession number GWHERDS0000000057. The allele genes data has been deposited in the figshare database58. The transcriptomic map data includes the FPKM of gene among seven tissues59, the GO and KEGG enrichment results of different expression levels60, AS events and APA sites61 can be store in the figshare database. The above data have been deposited in NCBI. The genomic raw data can be found in Sequence Read Archive (SRA) under the SRR3232960362 and SRR3232960463, while the assembly data has been deposited at GenBank under the accession JBLWMQ00000000064 and JBLWMR00000000065, and the genome’s annotation information is available in the figshare66. Additionally, the transcriptomic raw data have been deposited in the SRA under the accessions from SRR32340655 to SRR3234067667,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88, and annotation files, read count files, FPKM files, and other processed files can be accessed through the GEO accession GSE28961589.

Technical Validation

Evaluation of the assembled and the annotated genome

The assembled haplotype genomes contained two high-quality haploid genomes, haplotype A and haplotype B. Haplotype A had 19 chromosomes with a genomic size of 401 Mb, while Haplotype B consisted of 19 chromosomes with a genomic size of 386 Mb. The GC content of both haploid genomes was 34.88% (Table 1). The scaffold N50 lengths of haplotype A and haplotype B were 22.63 Mb and 21.85 Mb, respectively (Table 1). Only one gap was found on the chromosome 15 of haplotype A and haplotype B (Supplementary Table 3).

The short reads and long reads were mapped to the assembled genome to evaluate genomic coverage by BWA90 and Minimap2. The RNA-seq reads were aligned to the assembled genome using HISAT2. After filtering out the non-primary alignment reads, we obtained a higher map ratio and coverage of sequencing reads (Table 2). We calculated the heterozygosity and single base error rate using Illumina reads, yielding a heterozygosity rate of approximately 0.0079% and an error rate of about 9.6e-06 (Q50). Using Hifi, Illumina and ONT data, we valued the GC content and sequencing depth under different GC content. The findings revealed no significant GC bias (Supplementary Fig. 1). The evaluation of the assembled genome was performed by BUSCO (v_2.0.1)91 with 1,440 groups from the lineage dataset embryophyta_odb10. The assembled genome BUSCO results indicated that complete core genes (including single-copy and multiple-copy genes) accounted for 96.0%, while the missing genes accounted for 3.2% (Table 3). This is consistent with previous WGD studies, suggesting that most of the genome of the Cushion willow has experienced duplication throughout its evolutionary history15. These BUSCO results showed a relatively high degree of gene completeness.

Table 2 The map ratio and coverage of reads obtained by different sequencing methods.
Table 3 BUSCO evaluation of the assembled genome and the annotated proteome.

By mapping the Hi-C data to the final assembled genome with Juicer, we observed strong chromosomal clustering and assembly (Fig. 3). A comparison of sequences from the published genome and the two haploids was conducted using Minimap2. The results indicated that the chromosome order was identical in both instances (Supplementary Fig. 2). Characteristic sequences such as telomeres, tandem repeats, and 5S and 8-5.8-28S rDNAs were identified on the chromosomes (Supplementary Fig. 3). In summary, the two haplotypes exhibited comprehensive and well-assembled assemblies. The high-quality Cushion willow genome can serve as a reference for studies in the future.

Fig. 3
figure 3

Hi-C heatmap of the two final assembly haploids. The colors from yellow (low) to red (high) indicate the strength of the interaction.

A total of 52,715 protein-coding genes were identified across various databases, accounting for 99.02% of the total (Table 4). The assessment of the annotated proteome was conducted using BUSCO, revealing that 97.7% of the complete BUSCOs were present in the annotated proteome (Table 3).

Table 4 Summary of predicted gene annotations in the S. brachista genome.

To characterize the assembly of alleles in the two haplotype genomes, we performed allele genes by the AlleleFinder pipeline. In total, 23,744 allele genes were identified, comprising 1,459 paralog and 647 tandem genes (Supplementary Table 6). The 17,885 pair of allele genes expressed.

Evaluation of the assembled transcriptome

The clean data from 21 samples were aligned to haplotype genome A using HISAT2, yielding an average mapping rate of 86.63%. This suggests a significant proportion of the mapped clean reads (Supplementary Table 7). Following the use of the reference assembly, we obtained 28,587 non-redundant transcripts. The mean number of expressed genes per tissue was determined as 21,949 using the threshold FPKM > 0.3, with highly expressed genes comprising 22.62% of all expressed genes (FPKM > 20) (Fig. 4). Using Venn showed the organ-specific expression genes (Supplementary Fig. 4). Furthermore, we identified 17,387 DEGs across seven organs (Supplementary Fig. 5). We classified the expressed genes based on GO terms and KEGG pathways. Our analysis showed that genes with different expression levels were enriched in various metabolic pathways and functions, with a significant proportion of DEGs being associated with specific pathways and functions in both KEGG and GO (Fig. 5). Full-length transcript data can identify more accurately AS and APA sites, which are crucial for researching the regulatory mechanisms of gene expression. Using the ONT full-length transcripts, we discovered a total of 33,414 AS events, including RI, 19,336; SE, 5,785; A3, 3,932; A5, 2,942; AL, 330; MX, 355; and AF, 701 (Fig. 6a). Through the use of LAPA, we identified 36,634 poly (A) sites distributed across intergenic regions, 3’ UTRs, introns, and exons (Fig. 6b).

Fig. 4
figure 4

The expression levels of genes across seven organs of S. brachista. (a) The gene profiles of seven various organs (FPKM > 0.3). (b) Genes with high expression levels in seven different organs. (FPKM > 20).

Fig. 5
figure 5

Volcano plots of DEGs in each pairwise comparison of different flowering lines and young leaves vs. mature leaves. (a) Female vs. Male. (b) Female vs. Monoecism. (c) Male vs. Monoecism. (d) Young leaf vs. Mature leaf. Blue is up, red is down, and gray is not significant.

Fig. 6
figure 6

The types and numbers of AS events and APA sites. (a) Alternative splicing events. SE, skipping exon; MX, mutually exclusive exon; A3, alternative 3’ splice site; A5, alternative 5’ splice site; AF, alternative first exon; AL, alternative last exon; RI, retained intron. (b) APA, Alternative polyadenylation sites.

The aforementioned evidence collectively indicates that the transcriptomic map is both dependable and precise. The preliminary transcriptomic map of the Cushion willow offers invaluable resources for elucidating the adaptation strategies of this species to environments at extremely high altitudes.