Background & Summary

The Tyrrhenian tree frog, Hyla sarda (De Betta, 1853), is a relatively small (38–40 mm) anuran species belonging to the tree frog family Hylidae. H. sarda is endemic to the islands of Sardinia, Corsica, and the Tuscan archipelago (in the western Mediterranean Sea1), and is the sister species of the European tree frog (Hyla arborea). H. sarda is found in temperate forest and shrubland, and breeds annually from spring to summer in a variety of lentic freshwater environments such as ponds and pools. It remains near the breeding sites during most of the year2. Although considered a common and widespread species and listed as Least Concern by the IUCN, the species may be threatened by the reduction of natural habitat3. The species’ present-day distribution and distinctive biogeographic history make it an ideal model for investigating the phenotypic legacies of past biogeographical events and the underlying genomic mechanisms.

During the last glacial period, the Tyrrhenian tree frog underwent a spatial diffusion event from northern Sardinia to Corsica, promoted by the formation of a temporary and wide land-bridge between these islands, and from Corsica subsequently reached the Tuscan Archipelago via jump dispersal4,5,6. This two-step dispersal range expansion offers the rare opportunity to explore the interplay between, and long-term impact of, neutral and non-neutral processes during historical range expansion events. Recent studies have dissected the phenotypic patterns of variation of H. sarda, along the historical range expansion route, revealing considerable phenotypic evolution along both the south-to-north axis of the expansion, and along the route of the jump dispersal event. Specifically, H. sarda from the newly colonized area in Corsica exhibited larger body sizes than those in the source area in Sardinia, longer limbs, greater efficiency in jumping and adhesion, and shyer and more prudent exploration behaviour7, higher ability to change colour8, and different rates of physiological ageing9 and telomere dynamics10. On the other hand, H. sarda sampled from Elba Island (i.e., the island colonised by jump dispersal) were bolder, and less performant in jumping and adhesion, as compared to individuals from Corsica11. Together, these findings suggest that during post-glacial range expansions, newly established populations could have been founded by non-random samples of the phenotypic makeup of the source populations, and that different forms of dispersal might imprint distinct directions to phenotypic evolution. However, the genomic architecture of the observed phenotypic diversity remains unexplored, hindering a mechanistic understanding of the phenotypic evolution during range expansion.

Due to their distinct biological and evolutionary characteristics, assembling amphibian genomes present unique challenges and opportunities compared to the assembly of other vertebrates. Amphibians often have large and complex genomes, due primarily to a high proportion of repetitive elements12. Amongst the published anuran chromosome-level genome assemblies, genome spans vary from 988 Mb in the plains spadefoot toad (Spea bombifrons) to 10.2 Gb in the mountain yellow-legged frog (Rana muscosa) (Table S1)13. Despite their highly repetitive nature, chromosome number variation among anurans is limited, with the majority of cytological and genome assemblies demonstrating a karyotype of 2n = 10–12 chromosomes13,14. Chromosomes are highly syntenic across species, as contemporary chromosome structure is derived from 13 ancestral chromosomes15. Due to drastic heterochiasmy, anuran sex-chromosome evolution is highly dynamic16, and although all Hyla tree frogs typically have a homomorphic X/Y sex chromosome system, it has recently been reported that H. sarda has Z/W sex determination17. The generation of a high-quality genome assembly for Hyla sarda will assist future investigations into sex chromosome evolution in Hyla tree frogs. To date, 35 anuran chromosome-level genomes have been assembled13 (Table S1). No genome assemblies have yet been produced for the Hyla genus, although assemblies for the common tree frog (Hyla arborea) and Savingny’s tree frog (Hyla savignyi) are in progress.

This study presents a chromosome-level genome assembly (Fig. 1) of the Tyrrhenian tree frog, Hyla sarda, assembled as part of the Vertebrate Genomes Project (VGP)18 using PacBio HiFi sequencing, Bionano optical maps and Arima Hi-C technology. The assembled genome spans 4.1 Gb, with a scaffold N50 of 385 Mb, and comprises 13 chromosomes and 3,412 unplaced scaffolds (Table 1). The assembly is high-quality with a BUSCO completeness of 94.60% and a k-mer completeness of 98.29%. A total of 74.94% (3.1 Gb) of the genome comprises repetitive sequences (Table 2), and 22,847 protein-coding genes were predicted (Table 3), with a BUSCO completeness of 94.60% and an OMArk completeness of 93.74%. This high-quality assembly and accompanying annotation will serve as a valuable resource for investigating the species’ evolutionary history, uncovering the genetic signatures of phenotypic change during range expansion, and future population and conservation genomics studies.

Fig. 1
figure 1

Genomic features of the Tyrrhenian tree frog (Hyla sarda). Circos plot of genome characteristics, showing (from the outside to the inside): (a) Chromosome ideograms; (b) GC content in 50 kb windows; (c) Protein-coding gene content in 200 kb windows; (d) DNA transposon content in 50 kb windows; (e) LTR content in 50 kb windows; (f) LINE content in 50 kb windows. See Tables 13 for more detailed statistics.

Table 1 Genome assembly statistics.
Table 2 Repeat annotation statistics.
Table 3 Gene annotation statistics.

Methods

Sample collection, extraction, and sequencing

The genome sample was obtained from an adult female collected in mainland Corsica in 2018. Biological tissue (hind leg muscle and whole brain) was flash-frozen and stored at −80 °C. Sampling procedures were performed under the approval of the Prefét de la Corse-du-Sud (#2A20180206002 and #2B20180206001). RNA-seq data derived from the brain tissue of nine individuals19 was used for genome annotation. Whole brain tissue was stored in RNAprotect Tissue Reagent (Qiagen) and stored at −20 °C.

For the long-read PacBio HiFi sequencing, high molecular weight (HMW) DNA was isolated from skeletal muscle using the MagAttract HMW DNA Kit (Qiagen 67563). A total of 124 mg of frozen tissue was disrupted with a Qiagen TissueRuptor II (Cat. No. 9002755). After the tissue homogenization, lysis and subsequent DNA isolation was performed following the protocol described in the MagAttract HMW DNA Handbook (Manual Purification of High-Molecular-Weight Genomic DNA from Fresh or Frozen Tissue). The purified DNA was eluted in 100 µL of Qiagen Buffer AE. The DNA was quantified with triplicate measures using a Qubit 3 fluorometer (Invitrogen Qubit dsDNA Broad Range Assay cat no. Q32850). Prior to PacBio library preparation, the DNA was sheared using the Megaruptor 3 (Diagenode, Denville, NJ, USA). HiFi libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 following the manufacturer’s protocol (Pacific Biosciences, Menlo Park, CA, USA). Size-selection was performed with a Pippin HT (Sage Science, Beverly, MA, USA). The libraries were sequenced on a PacBio Sequel IIe, with Sequencing Plate 2.0 and 8 M SMRT cells, generating a total of 129 Gbp of data (~31X coverage).

For the Bionano optical map libraries, HMW DNA was extracted from skeletal muscle using the Circulomics Nanobind Tissue Big DNA Kit (PacBio, CA, USA). The DNA was quantified using the Qubit 3 fluorometer (Invitrogen Qubit dsDNA Broad Range Assay cat no. Q32850) and fragment size was assessed with a pulsed field gel electrophoresis (Pippin Pulse, SAGE Science, Beverly, MA, USA). 750 ng DNA was labelled using direct labelling enzyme (DLE1) and the Bionano Prep Direct Label and Stain (DLS) protocol (document number 30206) and then imaged on a Bionano Saphyr instrument, generating 790 MiB of data (~130X coverage).

For the Hi-C libraries, 28 mg of skeletal muscle was used for the Arima Genomics crosslinking reaction following the manufacturer’s low input sample amount guidance (Arima High Coverage HiC Kit Document Part Number: A160162). Libraries were prepared using the Arima-HiC 2.0 kit (Arima Genomics, CA, USA). The library was sequenced with the Illumina NovaSeq. 6000 platform with 150 bp paired-end reads, generating a total of 233 Gbp of data (~56X coverage).

For the genome annotation, RNA was extracted using the RNeasy Plus Kit (Qiagen), following manufacturer instructions. RNA quality and concentration were evaluated using an Agilent Cary60 UV-vis and a Bioanalyzer Agilent 2100 (Agilent Technologies, Santa Clara, CA, USA). Library preparation and sequencing were performed at NovoGene (UK). Libraries were 150 bp paired-end sequenced on an Illumina NovaSeq. 6000, generating a total of 986 Mbp of data. Further information on the RNA-seq samples can be found in Libro et al. 202219.

Genome assembly

The genome was assembled using the VGP v2.1 Galaxy pipeline20. Prior to assembly, we estimated the genome parameters with k-mer profiling, counting k-mers using Meryl21 and analysing the profile with GenomeScope v222. Using a k-mer size of 21 (ploidy = 2), the estimated haploid span was 3.75 Gb, with a heterozygosity of 1.08%. Notably, k-mer profiling revealed a highly repetitive genome, with a repeat length of 2.3 Gb. Direct C-value estimates for Hyla arborea indicate a C value of 4.76 Gb (2.4–7.0).

HiFi sequences and Hi-C data were used as input to assemble phased contigs using HiFiasm v0.16.1 in Hi-C mode23. The resulting haplotypes were scaffolded using the Bionano and Hi-C contact data. Bionano scaffolding was achieved using Bionano Solve v3.7.024 with default parameters and without contig breaking. Hi-C scaffolding was performed on the Bionano scaffolds. Hi-C reads were aligned and prepared for scaffolding using the Arima mapping pipeline, which employs bwa mem25 and samtools26 for mapping and filtering. Scaffolding was performed using YaHS v1.227. PretextMap (github.com/wtsi-hpag/PretextMap) was used to visualise Hi-C contacts before and after scaffolding. Scaffolding with Bionano and Hi-C data improved the assembly N50 from 3.88 Mbp to 417.68 Mbp. The primary haplotype was manually curated using PretextView (github.com/sanger-tol/PretextView) to correct potential assembly structural errors, to manually join and align unplaced scaffolds, and to name chromosomes28. We obtained a final chromosome-level genome assembly of 4.15 Gb (Table 1), which was curated into 13 chromosomes (Fig. 2C) ranging from 620.7 Mb to 50.27 Mb29. The final assembly span (4.15 Gb) exceeds the k-mer estimate of 3.75 Gb, reflecting the genome’s high repeat content. Assessment of the k-mer copy-number distribution confirmed that the H. sarda is diploid and revealed a diploid sequencing coverage of 30X and a haploid coverage of 15X (Fig. 2A). Assessment of the k-mer distribution between the primary haplotype and alternate haplotype assemblies revealed that diploid regions are shared by both assemblies and evidenced a high overlap between the haploid coverage k-mers (Fig. 2B). Genomic features were visualized using Circos30.

Fig. 2
figure 2

Genome assembly characteristics of the Tyrrhenian tree frog (Hyla sarda). (a) Copy number (CN) distribution plot of k-mer multiplicity, coloured by the number of times each k-mer is found in the assembly, where grey represents read-only k-mers, red represents one-copy k-mers, blue represents two-copy k-mers, green represents three-copy k-mers, purple represents four-copy k-mers, and orange represents five-and-more-copy k-mers. (b) Assembly (ASM) distribution plot of k-mer multiplicity, coloured according to which assembly contains the k-mers, where grey represents k-mers found only in the reads, red represents the primary haplotype assembly k-mers, blue represents the alternate haplotype assembly k-mers, and green represents the shared k-mers found in both assemblies. (c) PretextMap image of the 13 scaffolded pseudo-chromosomes of assembly 1 after curation (aHylSar1.pri.cur).

Transposable element annotation

For transposable element (TE) annotation, we used the EarlGrey TE annotation pipeline, which has been shown to increase TE consensus sequence length and resolve spurious overlapping and fragmented annotations31. EarlGrey v5.0.0 was run using RepeatModeler v2.0.632 and RepeatMasker v4.1.733 with NCBI/RMBLAST 2.10.0 + against the Dfam v3.834 Sarcopterygii partition and the Repbase RepeatMasker edition (version 20181026)35 libraries. Spurious TE annotations < 100 bp were removed. In total, 3.1 Gb of repetitive sequence was detected, constituting 74.94% of the H. sarda genome assembly. DNA transposons were the predominant family, spanning 941 Mb (22.71%), followed by LTR retrotransposons (562 Mb; 13.58%), LINEs (368 Mb; 8.88%), and SINEs (23 Mb; 0.56%) (Table 2). In addition, 867 Mb (20.94%) of repeats remained unclassified, indicating the presence of potentially novel repeat families that warrant further investigation in the context of the species’ phenotypic evolution.

Gene prediction and functional annotation

Gene prediction and functional annotation was performed by the National Center for Biotechnology Information (NCBI) using the NCBI Eukaryotic Genome Annotation Pipeline36 on the aHylSar1.pri.cur assembly29. To assess annotation quality, BUSCO v4.1.437,38 analysis was performed, using the tetrapoda_odb10 (n = 5310) OrthoDb v1039 lineage dataset, and OMArk v0.3.040 using OMAmer v2.0.3 was run using the Tetrapoda clade (11,140 HOGs), using the longest isoform of each protein. In total, 102,483 genes and pseudogenes were predicted, including 22,847 protein-coding genes (56,007 fully-supported mRNAs), and 65,576 non-coding RNAs (see Table 3 for complete annotation statistics).

Mitogenome assembly and annotation

The mitogenome was assembled using MitoHifi v241, using MitoFinder v1.4.242 for annotation. The mitogenome of Hyla annectans (KM271781.1)43 was used as the starting sequence. The resulting circularised mitogenome was 18,195 bp in length and contained the standard 37 vertebrate mitochondrial genes (13 protein-coding, 22 tRNAs, and 2 rRNAs).

Data Records

Raw sequencing and mapping data are available from the VGP GenomeArk repository (https://genomeark.github.io/genomeark-all/Hyla_sarda.html) and on the NCBI/ENA under BioProject: PRJNA1294985 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1294985).

The primary genome assembly (aHylSar1.hap1) is available at NCBI GenBank under the accession GCF_029499605.129. It is also available in Ensembl Rapid Release (https://rapid.ensembl.org/Hyla_sarda_GCA_029499605.1/Info/Index) and the UCSC Genome Browser (https://genome.ucsc.edu/h/GCF_029499605.1).

The alternate haplotype (aHylSar1.hap2) is available at NCBI GenBank under the accession GCA_029493135.144. It is also available in the UCSC Genome Browser (https://genome.ucsc.edu/h/GCA_029493135.1).

The mitochondrial genome sequence is available in NCBI GenBank, accession CM056048.145.

Technical Validation

Quality and completeness of the assembly was performed at every step of the assembly process using with Merqury v1.321, gfastats46, BUSCO v5.3.037,38 with the tetrapoda_odb10 (n = 5,310) OrthoDb v1039 dataset. The BUSCO completeness is 94.60% complete (92.80% as single-copy, 1.70% as duplicated), 0.40% fragmented, 4.80% missing. The Merqury k-mer assessment revealed a QV score of 59.88 and a completeness of 98.29%. We found that the majority (91%) of the assembled genome is contained within the 13 largest scaffolded chromosomes confirmed by Hi-C analysis. The assembly contains 2,208 gaps. Telomeric repeat sequences, identified using tidk v0.2.3147, were found to be enriched on at least one end of 8 of the 13 chromosomes.

To assess annotation completeness, we also used BUSCO analysis (as above) in protein mode. We also performed assessment using OMArk v0.3.040 (Tetrapoda, n = 11,140 HOGs), identifying 93.74% complete HOGs (of which 90.96% are single-copy and 2.78% are duplicated (0.47% expected, 2.32% unexpected), and which 6.26% are missing. With OMArk, the proteome showed a 92.59% consistent lineage placement (9.84% partial hits, 2.07% fragmented). No contamination was identified.