Background & Summary

Potato (Solanum tuberosum) is one of the most important and widely cultivated crops worldwide, with a significant role in global food security and agricultural research. Despite its significance, many studies still rely on the genome of the double monoploid (DM) clone of group Phureja DM1–3 516 R441,2 which lacks a substantial portion of the gene repertoire and variability found in cultivated tetraploid potato varieties3.

The potato cultivar Désirée is a red-skinned late-season potato variety, originally bred in the Netherlands in 1962 by crossing parent cultivars Urgenta and Depesche (Potato Pedigree Database)4. It is still cultivated due to its favourable agronomic traits, such as predictable yields and high tolerance to drought and some pathogens5. It has also been used in breeding programs, yet a genome assembly for the Désirée cultivar has not been available. In research, it has been propagated in tissue cultures, and used for genetic manipulation including gene overexpression6, gene silencing7, and Crispr-Cas gene editing8.

Although haplotype-resolved genome assemblies are becoming common in diploid organisms, the high heterozygosity rate, extensive repeat content, and the autopolyploid nature of cultivated potatoes still present significant challenges for generating high-quality haplotype-resolved assemblies. Currently, five haplotype-resolved genomes of autotetraploid potato cultivars are publicly available9,10,11,12,13 as well as several phased diploid genomes14,15,16. The recently published haplotype-resolved tetraploid potato assemblies rely on labour-intensive techniques such as single-pollen sequencing11 or the use of parental and crossing material12, which may not always be available.

Adding to existing publicly available genomes, we provide a reference quality (CRAQ overall AQI of 97.5) haplotype-resolved genome assembly of the tetraploid cultivar Désirée, assembled using solely PacBio HiFi and Illumina Hi-C data. Our assembly is accompanied by a comprehensive structural and functional gene annotation reaching 99.4% BUSCO completeness for Solanaceae, accompanied by orthology to DM genes. For the potato research community, we provide an online resource featuring a genome browser (https://desiree.nib.si) and downloadable genomic assembly and annotation files, providing a valuable tool for studies involving allele-specific expression or promoter analysis.

Methods

Sample preparation and sequencing

Leaves from 4-week old S. tuberosum cv. Désirée plants were collected and flash-frozen. High molecular weight genomic DNA (HMW gDNA) used for PacBio HiFi, Illumina and Oxford Nanopore Technologies (ONT) sequencing was extracted from the leaf tissues using a modified CTAB method17. The concentration and quality of the extracted DNA were assessed using a NanoDrop spectrophotometer.

PacBio HiFi

HMW gDNA was sent to National Genomics Infrastructure (NGI) Sweden for library preparation and sequencing on the PacBio Sequel II platform. We obtained 79.4 Gbp of raw data, consisting of 4.1 million reads.

Illumina Hi-C

Leaves from 4-week old S. tuberosum cv. Désirée plants were collected, flash-frozen in liquid nitrogen and ground using mortar and pestle. Hi-C library prep using the Omni-C kit (Dovetail Genomics) and sequencing were performed on an Illumina NovaSeq 6000 platform by NGI Sweden. Sequencing generated 2018.4 million paired-end (2 × 150 bp) reads.

ONT

The HMW gDNA was used for ONT DNA library prep using the SQK-LSK110 kit and sequenced on a MinION using the FLO-MIN106 flow cell. Reads were basecalled using Dorado (v0.7.2) with the model dna_r9.4.1_e8_sup@v3.3 which generated 5.8 Gbp. The reads with methylation-related tags were converted to bedMethyl format using modkit (v0.4.1).

Illumina short reads

Illumina short-read library was constructed from the HMW gDNA and sequenced on Illumina NextSeq 2000 by ELIXIR Slovenia node to generate 150 bp paired-end reads. The short-read sequencing generated approximately 138 Gbp of raw data, consisting of 460.1 million paired-end (2 × 150 bp) reads.

Genome size and heterozygosity estimation

The genome characteristics of S. tuberosum cv. Désirée, including genome size, heterozygosity, and repeat content, were estimated using Illumina short-read data and a k-mer based approach. A 21-mer frequency distribution was generated with Jellyfish (v2.2.10), and the genome’s key features were inferred using GenomeScope2 (v2.0). The haploid genome size was estimated at 669.6 Mbp, with a heterozygosity rate estimated at 3.8–5.7%.

De novo genome assembly, Hi-C scaffolding and quality assessment

PacBio HiFi and Illumina Hi-C reads18 were initially assembled into four sets of haplotype-resolved contigs using Hifiasm (v0.19.8-r603)19,20,21. Hifiasm primary unitigs were searched against DM genome assembly with blastn (v2.5.0)22 and best matches were visualised on Graphical Fragment Assembly with Bandage (v0.8.1, Fig. 1a)23. We performed quality control of the contigs using Merqury (v1.3, Fig. 1b)24 k-mer spectra and BUSCO completeness scores (v5.4.7, solanales_odb10 dataset)25. The length of haplotype draft assemblies ranged from 761.6 Mbp to 888.4 Mbp with contig N50 sizes ranging from 7.0 Mbp to 13.7 Mbp (Table 1).

Fig. 1
figure 1

General characteristics of Désirée genome assembly (a) Assembly graph of primary unitigs coloured by best match to DM chromosomes (also designated with numbers on the graph). (b) Merqury k-mer spectra for initial contigs and scaffolded chromosomes. The k = 21 was used. K-mers are categorized as read-only (grey), unique (red), and shared (blue, green, purple, orange). Peaks corresponding to higher multiplicities indicate the presence of highly repeated k-mers. (c) Dot plot comparing cv. Désirée chromosome-anchored contigs with DM v8.1 chromosomes. The colour designates contig identity. (d) Genomic synteny of cv. Désirée haplotype-resolved assembly.

Table 1 Summary of the four haplotypes of the Désirée genome assembly.

Contigs identified as contaminants were removed based on blastn (v0.8.1) searches against a custom-built contaminant database, which includes Solanum plastid and mitochondrial sequences as well as bacterial sequences all downloaded from NCBI RefSeq release 218 (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/).

Decontaminated scaffolds were anchored to chromosomes by mapping Hi-C reads to each haplotype set separately following the manufacturer’s recommended pipeline for Omni-C data (https://omni-c.readthedocs.io). Briefly, Hi-C reads were mapped using BWA-MEM (v0.7.17-r1188)26 then the mappings were parsed with pairtools (v0.3.0)27 followed by samtools (v1.3.1)28 to identify and extract valid pairs. Valid pairs were used to anchor and orient scaffolds into chromosomes using YaHS (v1.2a.1)29 and Juicebox Assembly Tools (v2.17.00)30,31.

Chromosomes 11 and 12 of haplotype 4 lacked ~20 Mbp and ~30 Mbp part of the pericentromeric region, respectively, and haplotype 1 contained two additional unplaced scaffolds (scaffold_22 and scaffold_23). Alignment of these scaffolds to reference genome (DM v6.1) and inspection of Hi-C contacts suggested that these scaffolds are the missing regions of chromosomes 11 and 12 in haplotype 4. Therefore, we remapped Hi-C reads and incorporated these two scaffolds in haplotype 4 using Juicebox Assembly Tools (v2.17.00).

The final scaffolded assembly size amounts to 3.3 Gbp, with individual haplotypes ranging between 762 and 888 Mb. As expected, one haplotype is highly similar to the DM haplotype, whereas other haplotypes can be more dissimilar (Fig. 1c). A comparison of Merqury k-mer spectra between the initial contigs and the scaffolded chromosomes (Fig. 1a) reveals that many apparent duplications in the contigs are resolved during scaffolding. A small proportion of sequences remains missing from the chromosomes and those can be found in the whole genome FASTA.

The haplotype assemblies were sequentially aligned using minimap2 (v2.28) and analyzed with SyRi (1.7.0) to identify syntenic regions and structural rearrangements which were visualized using plotsr (v1.1.1, Fig. 1d).

Genome annotation

Repeat elements in the S. tuberosum cv. Désirée genome were identified using the Extensive de novo TE Annotator (EDTA, v2.2.1)32. Repetitive sequences cover 489–534 Mbp per haplotype, representing more than 70% of the genome (Table 2).

Table 2 Summary of genome annotations for each haplotype.

The prediction of protein-coding genes in the assembled S. tuberosum cv. Désirée was determined using five complementary approaches: de novo, homology-based, transcriptome-based, deep-learning, and reference-based predictions (Fig. 2). The annotation pipeline was run for each haplotype independently.

Fig. 2
figure 2

Workflow overview of S. tuberosum cv. Désirée genome annotation.

For transcriptome-based prediction, two methods were applied for short reads33,34,35,36,37,38,39,40 and Iso-Seq reads41, respectively. Short reads from multiple tissues were aligned to each haplotype using STAR (2.7.10a)42, and transcripts were assembled with StringTie2 (v2.2.1)43, followed by Portcullis (v1.2.4)44 for junction validation. Iso-Seq reads from five S. tuberosum cultivars were mapped to both haplotypes using minimap2 (v2.28)45, and transcripts were generated using IsoQuant (v3.3.1)46 and TAMA Collapse (tc_version_date_2023_03_28)47.

BRAKER3 (v3.0.8)48 was used in ETP mode to predict gene models by integrating de novo, homology-based, and transcriptome-based predictions. Repeat masking of the assembly was performed with RepeatMasker (v4.1.2), using EDTA annotations. Protein sequences from OrthoDB (green plant orthologs) were provided as evidence, and short-read STAR alignments with invalid junctions removed were included.

Helixer (v0.3.3)49,50 was used for deep-learning-based gene prediction via its web interface (https://www.plabipd.de/helixer_main.html). Gene models from the S. tuberosum reference genome (DM v6.1, UniTato annotation) were transferred to the Désirée assembly using Liftoff (v1.6.3)51. All five transcript or gene model sets were consolidated using Mikado (v2.3.4)52 and UniProt plants database (review version 2024_04_22)53 to generate a non-redundant set of transcripts. Protein-coding gene completeness was assessed using BUSCO (Tables 2, v5.4.7, solanales_odb10 dataset) and OMArk (v0.3.0, omamer v2.0.2)54. The quality of protein-coding gene annotations was assessed using PSAURON (v 1.0.4)55 and results were added to the GFF3 annotation file.

Transcriptomic data used for gene annotation was downloaded from public repositories: SRA under accessions SRP54834434, SRP54537635, SRP31582741, SRP35813033, SRP55684836 and SRP54787537; the Gene Expression Omnibus (GEO) under accession GSE23202839; and the National Genomics Data Center (NGDC) under accession CRA00601238. Existing gene models used in the gene annotation pipeline were downloaded from https://unitato.nib.si and https://spuddb.uga.edu.

The predicted protein-coding genes were functionally annotated using EggNOG Mapper (v2.1.11)56 with the EggNOG database (version 5.0.2)57 for the Viridiplantae subset. This included categories such as gene names, Gene Ontologies (GOs), enzyme functions (EC), and KEGG pathways, reactions, and modules, along with CAZy families, PFAM domains, and more. Additionally, functional land-plant protein annotations were predicted using Mercator4 (v7)58 via the web platform (https://www.plabipd.de/mercator_main.html). Annotations from EggNOG and Mercator4 were combined into the final GFF3 annotation file.

Orthologous groups between haplotypes and UniTato genes were identified using OrthoFinder (v2.5.5)59 using default setting. Across haplotypes, 55.3% of orthogroups contained genes from all four haplotypes, 22.9% from three haplotypes, 19.2% from two haplotypes, and 2.7% from a single haplotype. This is in line with the haplotype-resolved potato genome assemblies of cv. Atlantic10 and cv. Otava11. When comparing the Désirée annotation to UniTato, 17.24% of genes were specific to the Désirée annotation.

Data Records

The raw sequencing data, including Illumina Hi-C, Illumina paired-end, PacBio HiFi, and ONT reads, have been deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under BioProject number PRJNA118502818. Plastid, mitochondrial and bacterial sequences used for removal of contaminant contigs were downloaded from NCBI RefSeq release 218. The genome assemblies of the four haplotypes have been submitted to NCBI GenBank under the umbrella BioProject accession PRJNA121701160,61,62,63,64. The assembled genome, including annotations, methylation profile and identified orthologs, is hosted in a Zenodo repository under https://doi.org/10.5281/zenodo.1528255365 and is also accessible via an interactive genome browser at https://desiree.nib.si.

Technical Validation

We assessed the assembly quality and completeness using DNA sequencing read mapping, CRAQ, BUSCO analysis, and Merqury k-mer based evaluation. Illumina reads were mapped with BWA (v0.7.17), while PacBio and ONT reads were aligned using minimap2 (v2.28). Mapping rates were 99.90%, 100.00%, and 99.74% for Illumina paired-end, PacBio, and ONT reads, respectively. CRAQ (v1.0.9)66 analysis of PacBio and Illumina mappings yielded a regional AQI of 96.3 and an overall AQI of 97.5, classifying the assembly as reference quality (AQI > 90). Assembly completeness was assessed with BUSCO (v5.4.7) using the solanales_odb10 lineage database, identifying 5930 (99.6%) of the 5950 BUSCO orthologous groups in both the whole genome and chromosome-only assemblies (Table 1). Merqury (v1.3) analysis, using a Meryl (v1.3) database constructed from Illumina reads, estimated genome completeness at 98.57% for the whole genome and 95.73% for the chromosomes. The estimated QV values were 54.30 and 58.53 for the whole genome and chromosomes, respectively.

Completeness of gene annotation was assessed using OMArk (v0.3.0, omamer v2.0.2), BUSCO (v5.4.7) and Mercator4 (v7). OMArk analysis demonstrated that our annotation captured 94.1%–94.6% of Hierarchical Orthologous Groups (HOGs) per haplotype, with duplication rates ranging from 11.5% to 11.9% (Fig. 3a). When combining genes from all haplotypes, the proportion of complete HOGs reaches 99.3%, meaning that not all conserved genes are present in all haplotypes. Similarly, BUSCO analysis reported a haplotype completeness range of 93.3%–95.4% (Table 2), while the whole genome annotation achieved 99.4% completeness. Protein classification via Mercator4 revealed that 93.9%–94.6% of Mercator bins were occupied per haplotype, increasing to 97.5% when combining all proteins (Table 2). As expected, the Mercator bin with the largest proportion of missing proteins was associated with clade-specific metabolism (Fig. 3b). Additionally, the classified proteins showed no significant deviation from the median protein length, confirming consistency in annotation quality (Fig. 3c).

Fig. 3
figure 3

Validation of gene annotation. (a) OMArk quality assessment showing consistency, completeness and count of proteins across all four haplotypes. (b) Histogram showing the percentage of Mercator4 functional bins occupied by the Désirée proteins. (c) Histogram displaying the distribution of proteins grouped by their percentage deviation from the median protein length.

Usage Notes

The presented Désirée genome assembly is of high contiguity, completeness and phasing quality and presents a valuable resource for haplotype-aware transcriptomics, proteomics and epigenomics analyses. The transfer of UniTato annotations67 provides translation of gene identifiers from the DM to the Désirée genome. The RNA-seq datasets used to supplement gene model annotation are predominantly from mature leaf and root tissue, thus genes specifically expressed in other tissue and developmental stages may not be fully captured in the current annotation.

The genome was produced from a plant propagated in tissue culture for over a decade. A recent pangenome study3 found that in vitro propagated plants of the Solanum section Petota have greater numbers of TEs in their genomes. While this seems to hold for LTR elements and DNA transposons in the Désirée genome, overall TE expansion is not evident. Examining the DNA methylation profile available in the Désirée genome browser might provide more insight into specific transposable element expansion in this cultivar.

Recently, efforts were made to generate potato pangenomes3,10. However, the number of included phased tetraploid genomes is still limited. Including Désirée and more phased tetraploid genomes will improve the completeness of potato pangenome. This will bridge knowledge gaps in potato genomics and give potato breeders a powerful toolkit for developing more resilient and productive cultivars.