Background & Summary

The cultivated strawberry (Fragaria × ananassa), a perennial plant belonging to the Rosaceae family, is an allo-octoploid species with a highly heterozygous genome that contributes to its genetic complexity and diverse phenotypic traits. This complexity poses a significant challenge for genetic research and breeding programs. Strawberries are a globally crucial crop, with the United Nations Food and Agricultural Organization (UN-FAO) reporting worldwide production of 9.57 million tons in 2022 (https://www.fao.org/faostat/). In South Korea, strawberries are a significant economic crop, with a cultivation area of 5,745 ha and a production volume of 158,807 tons in 20221. The domestic production value of strawberries in South Korea is approximately USD 932 million, accounting for 14.7% of the total vegetable production value in the country2.

Among the various Korean strawberry cultivars, ‘Seolhyang’ (‘Akihime’ × ‘Red Pearl’), developed in 20053, dominates the South Korean market, occupying 82.1% of the total strawberry cultivation area in 20224. ‘Seolhyang’ is favored for its ease of cultivation; large fruit size; high yields5,6,7; and resistance to diseases such as angular leaf spot, anthracnose, and powdery mildew3,8,9,10. In an analysis of 45 representative Korean cultivars and genetic resources, ‘Seolhyang’ was distinguished by having the highest overall concentration of volatile organic compounds (VOCs)11. Various breeding programs have been initiated to harness the desirable traits of the elite cultivar ‘Seolhyang’. However, progress in precision breeding efforts has been hindered by limited genomic research on ‘Seolhyang’.

The availability of reference genomes has substantially affected agricultural research and has driven significant advancements in the understanding of the genetic basis of plant traits. This genomic insight reveals how artificial selection shapes these traits over time. This has deepened the understanding of how genetic characteristics influence interactions within agricultural ecosystems, particularly with pathogens and insects12,13. Recently, the assembly of reference genomes in agriculture has undergone significant advancements, particularly owing to the integration of third-generation sequencing technology14. These developments have enhanced the quality and completeness of plant reference genomes. High-throughput sequencing methods, such as next-generation sequencing (NGS), have enabled the generation of extensive genomic data. However, to overcome the limitations associated with short-read sequences in contigs and scaffolds, long-read sequencing technologies, such as PacBio, BioNano, and Nanopore, have emerged as pivotal tools for third- and fourth-generation sequencing15,16. Pacific Biosciences (PacBio) High-Fidelity (HiFi) sequencing technology generates long reads with an average length ranging from 10 to 25 kb and an error rate of less than 0.5%. This level of accuracy and read length position of HiFi sequencing is the primary source of data for producing high-quality genome assemblies17,18. Advances have addressed some of these challenges, particularly regarding the assembly of telomere-to-telomere (T2T) gap-free reference genomes. Notably, for cultivated and diploid strawberries19,20,21,22,23, there has been the successful assembly of such high-quality genomes for the ‘Hawaii 4’, ‘Benihoppe’ and ‘Florida Brilliance’ cultivars, providing more reliable references in currently available genomic resources.

In this study, a high-quality genome assembly of the strawberry cultivar ‘Seolhyang’ was generated using approximately 100 Gb of HiFi sequencing data obtained from the PacBio Revio platform. Unlike previous assembly methods for octoploid strawberry genomes, this assembly was completed without incorporating data from additional sequencing platforms, resulting in a high-quality reference genome comparable to those of ‘Royal Royce’ and ‘Florida Brilliance.’ We completed a telomere-to-telomere genome assembly with a genome size of 797 Mb and a contig N50 of 27.04 Mb. Benchmarking of the universal single-copy orthologs (BUSCO) analysis detected 99.1% conserved genes in the assembly. In addition, the average of long terminal repeat assembly index (LAI) was 17.28, reflecting the overall high genome continuity based on analysis of intact and total LTR retrotransposons measured using Extensive de novo TE Annotator (EDTA) followed by LTR retriever. Notably, we identified 50 of the possible 56 telomeres across 28 chromosomes. The ‘Seolhyang’ genome was annotated using RNA-Seq data representing various F. × ananassa tissues from the NCBI for Biotechnology Information sequence read archive, which resulted in 129,184 genes. Powdery mildew is a significant disease frequently observed in controlled cultivation environments, such as plastic greenhouses, posing substantial challenges to strawberry production. The strawberry cultivar ‘Seolhyang’ is well known for its resistance to powdery mildew. This study utilized the assembled genome of ‘Seolhyang’ to investigate the genetic basis of its resistance, focusing on the MLO (Mildew Locus O) genes, which have been reported to be associated with powdery mildew resistance. A total of 55 MLO genes were identified in the ‘Seolhyang’ genome. Their structures and domains were systematically compared with 20 MLO genes previously reported in diploid strawberries and 69 MLO genes identified in the octoploid strawberry ‘Camarosa.’ These comparisons provide valuable insights into the unique genetic characteristics underlying the powdery mildew resistance of ‘Seolhyang’, suggesting that the genome of ‘Seolhyang’ will be a promising genetic resource for the identification studies of powdery mildew resistance genes and development of resistant cultivars.

Methods

Materials and DNA sequencing

The cultivated strawberry (F. × ananassa) cultivar ‘Seolhyang’ was used for genome sequencing. Young leaves were covered with black plastic bags and stored in a greenhouse for 14 d. The etiolated leaf tissue was harvested for DNA extraction. The leaves were frozen and subjected to genomic DNA extraction and library preparation by using DNA Link (Seoul, South Korea). The single-molecule real-time sequencing (SMRT) bell library for ‘Seolhyang’ was constructed using a PacBio DNA Template Prep Kit 3.0 (Pacific Biosciences, CA, USA). PacBio’s standard protocol (Pacific Biosciences, CA, USA) was used to build the SMRTbell target-size libraries. The library was sequenced using the PacBio Revio System (DNA Link, Seoul, South Korea).

De Novo genome assembly and validation

Figure 1 illustrates the workflow for the genome assembly and annotation implemented in this study. HiFi reads were used to produce a draft assembly without sequencing the parents by using Hifiasm ver. 0.16.124. Hifiasm was run with the following commands, according to the developer’s recommendations for heterozygous polyploid crops. The contigs were scaffolded and oriented based on the reference genome of ‘Florida Brilliance’ (https://www.rosaceae.org/Analysis/14031408) by using RagTag25.

Fig. 1
Fig. 1
Full size image

Workflow implemented for ‘Seolhyang’ genome assembly and annotation.

Genome assembly statistics were calculated using QUAST version 5.0.26626. Merqury version 1.3 was used to measure the assembly consensus quality value (QV) and to evaluate the assembly based on efficient K-mer set operations27. The completeness of the genome assembly and protein-coding gene annotations were assessed using the BUSCO database28. The long terminal repeat (LTR) assembly index (LAI)29 for each sub-genome was calculated using LTR-retriever30 along with whole-genome Transposable elements (TE)-annotations and intact LTR retrotransposons identified using EDTA31.

Genome annotation

The TEs were annotated using EDTA v1.9.6 with default parameters31. A TE annotation library was generated in separate runs by using EDTA. The TE regions of haploid assembly were masked using the ReapeatMasker v4.1.1 provided with the repeat library. Simple sequence repeats (SSRs) or microsatellites were mined using the SSR Finder on the Genome Sequence Annotation Server v6.0 (GenSAS; https://www.gensas.org)32. To increase the accuracy of gene annotation, we generated a transcriptome assembly containing possible sets of transcripts from ‘Seolhyang’ and publicly available F. × ananassa expression data. Read alignments were converted to the Binary alignment map (BAM) format by using SAMTools. Splice junctions for all merged RNA alignments were predicted and trimmed using Portcullis v1.2.233. Genome assemblies were annotated using Braker2. Functions of the predicted transcripts were annotated based on alignment by using BlastP v2.2.2834 in the UniProtKB database35.

Collinearity and synteny

Genomic synteny at the DNA level among F. vesca36, F. × ananassa cultivars ‘Royal Royce’37, and ‘Florida Brilliance’ (https://www.rosaceae.org/Analysis/14031408), and ‘Seolhyang’ was visualized using D-GENIES38 by applying default parameters after alignment with minimap239. Candidate structural variations were explored using SYRI40.

Technical Validation

Details of the sequencing data are shown in Table 1. With one single-molecule real-time cell on the PacBio Revio platform, 103.3 Gb of the sequence was generated in 9.1 M reads. The average read length was 17,668 bp with an N50 of 17,769 bp. The assembly contained 2,140 contigs with an N50 of 27.04 Mb. Fifteen contigs accounted for 50% of the total assembly (Table 2). The largest contig size was 36.27 Mb, which covered 99% of the chromosome length. Before scaffolding, BUSCO was 99.1%, and LTR analysis showed that the LAI score was 17.28, indicating the gold standard of the reference genome. Scaffolded contigs resulted in 796.9 Mb of a final genome size. Notably, only 30 contigs were anchored to the final assembly for ‘Seolhyang’.

Table 1 Summary statistics of PacBio Hifi reads used for genome assembly of ‘Seolhyang’.
Table 2 Statistics of the ‘Seolhyang’ genome assembly.

Identification and characterization of pectin lyase sequence analysis

The sequences with conserved MLO domains (cl03887) were retrieved on Pfam database41. The physical location of the MLO genes was retrieved from the genome annotation file. The conserved motifs were searched using the MEME42 and visualized with gene structure using TBtools43.

Based on the multiple alignment of MLO proteins obtained by the MUSCLE44, a phylogenetic tree was constructed by using the maximum likelihood method in Geneious Prime. The collinear gene pairs were generated using MCScanX45 software. The analysis was conducted using the default parameters of specific software according to the user instructions.

Data Records

The PacBio HiFi sequencing reads used for genome assembly have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number [PRJNA1148756] (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1148756)45.

The chromosome-level genome assembly has been deposited in GenBank under the accession number [JBKFVU000000000] (https://identifiers.org/ncbi/insdc.gca:JBKFVU000000000.1)46.

In addition, the gene annotation files and supplementary materials are available on FigShare (https://doi.org/10.6084/m9.figshare.26866807)47,48.

Collinearity between ‘Seolhyang’ and other published F. × ananassa genomes, namely ‘Florida Brilliance’ and ‘Royal Royce’, was confirmed. Translocations on 1D were apparent when the ‘Seolhyang’ genome was compared with the genomes of ‘Florida Brilliance’ (Fig. 2a) and ‘Royal Royce’ (Fig. 2b). Alignments of ‘Seolhyang’ assembly against FaRR1 (‘Royal Royce’) and FaFB1 (‘Florida Brilliance’) also displayed a high degree of collinearity (Figs. 2a and 2b). On the basis of this alignment, we applied the chromosome nomenclature for ‘Seolhyang’ and ‘Royal Royce’, reflecting the putative diploid origins of each respective subgenome (A, B, C, and D)37. Alignments of the ‘Seolhyang’ genome against the diploid F. vesca v4.036 showed a high degree of collinearity except for major translocations on 1 A (Fig. 2c). We confirmed the collinearity and consequently explored the candidate structural variations among ‘Seolhyang’, ‘Florida Brilliance’, and ‘Royal Royce’ by using SYRI41 (Fig. 3). Only ‘Seolhyang’ subgenome A showed higher sequence similarity with diploid F. vesca. Telomeric motifs (5’-TTTAGGG-3’) were explored at the end of each chromosome in the assembly of ‘Seolhyang’. Telomeric motifs enriched in the termini of the pseudo-chromosomes allowed for the identification of 50 telomeres (Table 3). All pseudomolecules contained telomere-rich regions, at least at their ends. Overall, 22 pseudomolecules were potentially telomere-to-telomere, except for Chr 1B, 1 C, 2 A, 3 C, 7 A, and 7B.

Fig. 2
Fig. 2
Full size image

Dotplot of ‘Seolhyang’ genome to F. × ananassa cv. Florida Brilliance (a), F. × ananassa cv. Royal Royce (b), and diploid F. vesca ver 4.0 (c). Dot plots were produced using the DGENIE software and alignments with minimap2.

Fig. 3
Fig. 3
Full size image

Collinearity analysis between ‘Seolhyang’ genome and other octoploid strawberry genomes, including ‘Florida Brilliance’ (FaFB1) and ‘Royal Royce’ (FaRR1).

Table 3 Information on telomeric motif enriched in the assembly for ‘Seolhyang’.

Genome annotation

In the ‘Seolhyang’ genome, 346.3 Mb of the repetitive sequence accounted for 43.46% of the genome. Most of this repeat sequence was composed of LTR TEs (25.4%; Table 4). For each chromosome, a genomic region with dense repetitive sequences and a low density of genes, thought to be the centromeres, was identified (Fig. 4). Genome sequences with a long TE (>1 kb) mask were used for gene prediction. De novo prediction of the number of gene-coding proteins in the genome assembly yielded 151,558 transcripts by aligning the RNA-Seq datasets with the assemblies. BUSCO analysis of the transcript assemblies revealed 2,275 complete core eudicot genes (97.8%, 3.5% single-copy, 94.3% duplicated), with 0.5% fragmented and 1.7% missing core eudicot genes. In total, 129,184 genes remained in the ‘Seolhyang’ genome (Table 5).

Table 4 Classification and distribution of repetitive DNA elements identified in ‘Seolhyang’ genome by EDTA pipeline.
Fig. 4
Fig. 4
Full size image

Distribution of transposable elements and genes in ‘Seolhyang’ genome. (a) length of assembled chromosomes, (b) distribution of DNA transposable elements, and (c) distribution of genes.

Table 5 Genes predicted in ‘Seolhyang’ genome.

Identification of FaMLOs in ‘Seolhyang’ genome assembly

A total of 55 FaMLO genes with MLO domains (cl03887) were identified. According to their homology to FveMLO genes from F. vesca, all FaMLO genes were renamed as FaMLO01C to FaMLO20D (Fig. 5). A maximum of five FaMLO genes were located on chromosome 3 C, while there were no FaMLO genes on chromosome 4 A, 4B, 4 C, and 4D. The characteristics properties of the deduced 55 FaMLO is shown in Table 6. The number of amino acids varied from 171 to 934 aa, most of them (53) were concentrated from 400 to 600 aa. There were only one FaMLO proteins comprising amino acids below 200 aa.

Fig. 5
Fig. 5
Full size image

Chromosomal distribution and location of FaMLOs in ‘Seolhyang’ strawberry. Different colors indicate the chromosomes from different subgenomes of cultivated strawberry.

Table 6 The physical characteristics of FaMLO genes in ‘Seolhyang’ genome assembly.

According to phylogenetic analysis for FaMLO genes identified in the present study and previously reported, all the fifty-five FaMLO genes were classified into seven clusters (Fig. 6a). Among them, clade 1 is the largest clade containing 14 members, followed by group 7, which had 11 members of FaMLO genes. To better elucidate the structural characteristics of the FaMLO genes, CDS distributions were analyzed and visualized (Fig. 6b).

Fig. 6
Fig. 6
Full size image

Classification and characterization of FaMLO identified in the genome assembly of ‘Seolhyang’. (a) Phylogenetic tree of FaMLOs from diploid and octoploid strawberries. Different branch colors represent the different groups. MLO family members from ‘Seolhyang’ strawberry identified in this study are marked with blue circles. The red stars and black rectangles indicate the previously reported FaMLOs in Fragaria vesca and F. × ananassa var. ‘Camarosa’. (b) Gene structure and conserved domain analysis of FaMLOs. Left part indicated an unroot tree of strawberry FaMLOs, middle part showed the exon–intron distribution of FaMLOs, and the right part displays the distribution of conserved domain on each FaMLO protein.

The collinearity analysis among woodland strawberry (F. vesca), and octoploid strawberry ‘Seolhyang’ was carried out to explore the evolutionary relationship of FaMLOs. According to the result, 55 FaMLOs and 17 FveMLOs were involved to form collinear pairs and were highlighted (Fig. 7).

Fig. 7
Fig. 7
Full size image

Collinearity analysis of MLO genes among Fragaria vesca, and Fragaria × ananassa genomes. Grey lines indicate collinear blocks within the two genomes, while the red lines represent collinear MLO gene pairs. The orange and green columns indicate the chromosomes from Fragaria vesca, and Fragaria × ananassa genomes, respectively. Chromosome numbers are displayed at the side of chromosomes.