Background & Summary

Fungi represent one of the most diverse and ecologically significant kingdoms of life, with an estimated diversity of 2 to 5 million species globally1,2. However, only approximately 150,000 species are described representing less than 5% of the total estimated diversity3. Fungal species definition has been debated for several decades as there is no consensus on how to delineate species4. With the introduction of molecular typing and DNA barcoding techniques, genomic regions like internal transcribed spacers (ITS) and 18S rRNA gene have been used extensively to molecularly type fungi and to assist in species definition. While this approach has been useful for decades, with increasing number of described species, single gene barcoding became inefficient to delineate species at high resolution. Therefore, molecular mycologists started to employ multi-locus typing and even whole genome sequencing to achieve higher taxonomic resolution, which can exceed the species-level to the sub-species level4. This is reflected currently by the increasing number of complete/draft fungal genomes deposited in different public databases5.

Fungi, like most eukaryotes, carry extrachromosomal DNA molecules, such as mitochondrial DNA or plasmids6. Those extrachromosomal DNA are present in multiple copies exceeding with order of magnitude the copy numbers of nuclear DNA. Mitochondrial genomes, with their compact size, high copy number, and conserved gene content, have been used in elucidating fungal phylogeny, population genetics, and evolutionary dynamics7,8,9,10. However, they are often ignored or advertently removed from the downstream analysis when performing whole genome sequencing of fungi11 or erroneously assembled as part of the nuclear genome12.

As of June 2024, the nucleotide database of the NCBI hosted 3,774 complete fungal mitochondrial genomes representing 1,114 species, with the majority belonging to the phyla Ascomycota and Basidiomycota, with fewer representatives from the other phyla. In contrast, the SRA database identified 99,472 fungal genomic sequencing records, from 4,994 species, representing under-utilized resources that could potentially be used for mitochondrial genome assembly. Leveraging these gaps between the fungal mitochondrial genomes and the available public SRA records, we undertook a large-scale effort to de novo assemble complete fungal mitochondrial genomes aiming to extend the current mitochondrial diversity to new groups. To achieve this, we employed a fungal mitochondrial genomes recovery workflow that involves quality control check, followed by de novo assembly from paired end (PE) short reads, and extraction of mitochondrial genomes (Methods).

In conclusion, this dataset significantly expanded the known diversity of fungal mitochondrial genomes by leveraging underutilized public sequencing data, successfully assembling complete mitochondrial genomes from 2,695 species—nearly tripling the previously available records. The dataset introduced the first mitochondrial genomes for numerous taxonomic groups, including 15 classes, 64 orders, 178 families, and 544 genera, with particularly notable expansions in understudied phyla such as Mucoromycota (11X) and Zoopagomycota (8X). By filling critical gaps in fungal mitochondrial genomics, this resource enables deeper phylogenetic, ecological, and evolutionary studies, while also providing a foundation for future research on fungal biodiversity, host-pathogen interactions, and mitochondrial evolution. The workflow and findings underscore the value of mining public sequencing data to unlock hidden genomic diversity, presenting a model for similar large-scale efforts in other eukaryotic lineages.

Methods

Data collection

Fungal mitochondrial genomes

To retrieve the already available complete mitochondrial genomes of the Kingdom Fungi, we used the NCBI online platform querying for “mitochondrion[Title] AND (“complete genome” OR “complete sequence”)” then applied the following filters: i) species == “fungi”; ii) molecule type == “genomic DNA”; iii) sequence type == “Nucleotide”; and iv) genetic compartment == “Mitochondrion”. The resulting records were exported to a sequence file in FASTA format.

Fungal SRA records

To check the available fungal short reads SRA records, we queried for “fungi” in the SRA database, then applied the following filters: i) source == “DNA”; ii) library layout == “paired”; iii) platform == “Illumina”; iv) strategy == “genome”; and v) file type == “fastq”. The resulting records were exported as a TSV file including all associated metadata.

To exclude the SRA records from species with already existing complete mitochondrial genomes, we mapped the accession numbers of the fungal mitochondrial genomes and the SRA accessions to the NCBI taxonomy database13, using the R package “taxonomizr” v0.10.7 (https://github.com/sherrillmix/taxonomizr). For each mitochondrial genome accession number, we used the function “accessionToTaxa” to get its taxonomic ID (taxID), then we used the function “getTaxonomy” to get the taxonomic lineage. We considered only the major taxonomic ranks i.e., kingdom, phylum, class, order, family, genus, and species. For the SRA records, we applied the function “getTaxonomy” directly on the taxID associated with each BioSample.

Then, all SRA records with representative mitochondrial genomes available, either in RefSeq or in the International Nucleotide Sequence Database Collaboration (INSDC) were excluded and the remaining SRA accessions were used for further de novo assembly.

Mitochondrial genome assembly workflow

For each SRA record, we downloaded the raw sequencing data in fastq format using the tool “fasterq-dump” v3.0.1 (https://github.com/ncbi/sra-tools), then trimmed the adapters and low-quality sequences using “fastp” v0.23.414 applying the default parameters. The resulting files were rarefied to 5 million paired end reads using “seqtk” v1.4 (r122) (https://github.com/lh3/seqtk); followed by de novo assembly using “SPAdes” v4.0.115 with the options “--gfa11 -k 21,33,55,77,89 --isolate”. To extract the mitochondrial genomes from the resulting assembly graphs, we used the tool “GetOrganelle” v1.7.7.116, specifying the search database “fungus_mt” to confine the search to fungal mitochondrial genomes. The tool labels the extracted mitochondrial genomes as complete and circular or as scaffolded genomes. For the scaffolded genomes, we repeat the process of rarefaction and the assembly but using 2 million paired end reads (Fig. 1).

Fig. 1: Overview of fungal mitochondrial genome de novo assembly workflow.
figure 1

Workflow for assembling fungal mitochondrial genomes from publicly available paired-end short-read whole-genome sequencing (WGS) data (For details, please refer to the Methods).

Mitochondrial genome annotation and phylogeny

The assembled complete mitochondrial genomes as well as those retrieved from the RefSeq and the INSDC were annotated using “MFannot” v1.3717, using their corresponding genetic code.

Phylogenetic analysis was conducted using the amino acid sequences of the mitochondrial protein-coding genes (PCGs) cox1, cox2, cox3, cob, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, atp6, atp8, and atp9. The sequences for each gene were aligned using “MAFFT” v7.50518, and poorly aligned regions were trimmed using “trimAl” v1.4.rev2219, followed by concatenation and alignment assessment using “AMAS” (https://github.com/marekborowiec/AMAS)20. A maximum-likelihood phylogeny was inferred with “FastTreeMP” v2.1.1121 under the LG + Γ model (Le & Gascuel amino acid substitution matrix with gamma-distributed rate heterogeneity). Branch lengths represent the number of substitutions per site, and node support values were estimated using FastTree’s approximate likelihood ratio test (SH-like local support). The tree was rooted with the phylum Rozellomycota (including 2 genomes), using the R package “castor” and visualized and annotated using the R packages “ggtree” and “ggtreeExtra”.

Data Records

Nucleotide sequence data reported are available in the Third Party Annotation Section of the DDBJ/ENA/GenBank databases under the BioProject PRJNA1367877 and the accession numbers TPA: BK072095-BK07478922.

Additionally, the following data files are available in Figshare23:

  1. 1.

    “Assembled_fungal_mitochondrial_genomes.fasta”: Combined FASTA file including all newly assembled mitochondrial genomes.

  2. 2.

    “fungal_mitochondrial_genomes_phylogeny.tre”: Maximum-likelihood phylogeny of all fungal mitochondrial genomes. The analysis includes all fungal complete mitochondrial genomes assembled in this study as well as genomes from the RefSeq and the INSDC databases.

  3. 3.

    “All_fungal_mitochondrial_genomes_accessions.tsv”: Table including NCBI accession numbers and taxonomic information of the newly assembled mitochondrial genomes as well as their corresponding BioProject accession, BioSample accession, and PubMed references (when available). The table additionally includes the accession numbers of the already existing mitochondrial genomes from the RefSeq and INSDC databases. For all records, genome sizes and GC content is provided.

Technical Validation

To validate the newly assembled mitochondrial genomes, we followed three strategies. First, we considered the genomes complete if they were identified by GetOrganelle16 as circular and complete. Second, after annotation with MFannot17, we set an empirical threshold for completeness based on the presence of a minimum number of core protein-coding genes (PCGs) at certain taxonomic ranks. While fungal mitochondrial genomes have no known set of core genes24, a typical fungus will contain 4 gene complexes, i.e., complex I (nad1, nad2, nad3, nad4, nad4L, nad5, and nad6), complex III (cob), complex IV (cox1, cox2, and cox3), and complex V (atp6, atp8, atp9)6. Based on the RefSeq mitochondrial genome annotations, we used different completeness thresholds for different taxonomic ranks (down to the order-level) as follows: 1) The orders of Saccharomycetales, Saccharomycodales, and Schizosaccharomycetales of Ascomycota should include at least 7 genes; 2) The order Pezizales of Ascomycota, to include at least 8 genes; 3) The phylum Chytridiomycota (orders Caulochytriales and Cladochytriales), to include at least 11 genes; 4) The order Lobulomycetales of Chytridiomycota to include 12 genes; 5) The orders Acrospermales, Aulographales, Eremomycetales, Jahnulales, Microthyriales, Mytilinidiales, Phaeotrichales, Pleosporales, and Venturiales of Ascomycota to include at least 12 genes; 6) The orders Pucciniales, Septobasidiales, and Urocystidales of Basidiomycota to include at least 13 genes; 7) and finally the remaining orders of Fungi to include 13 genes at least.

Finally, we used only the genomes that passed the previous filters and used them for phylogenetic overview, including genomes from the NCBI RefSeq and the INSDC databases (Fig. 2). The constructed phylogenetic tree included a total of 4,012 genomes providing a comprehensive overview of the Kingdom Fungi. The newly assembly genomes fell within the expected clades compared with the RefSeq and INSDC genomes.

Fig. 2: Validation of the newly assembled fungal mitochondrial genome in comparison with the RefSeq and INSDC genomes.
figure 2

(a) Proportion of fungal mitochondrial genomes from different sources. (b) Maximum-likelihood phylogeny of fungal mitochondrial genomes. The analysis includes all fungal complete mitochondrial genomes assembled in this study as well as species representatives from the RefSeq and the INSDC databases. The phylogeny was inferred with FastTreeMP from concatenated mitochondrial protein sequences (cox1, cox2, cox3, atp6, atp8, atp9, cob, nad1, nad2, nad3, nad4, nad4L, nad5 and nad6) under the LG + Γ model. Branch lengths represent the number of substitutions per site. Node support values were estimated using FastTree’s approximate likelihood ratio test (SH-like local support) but are not shown (79% of the nodes are supported with >70%, for details please refer to the dataset file “fungal_mitochondrial_genomes_phylogeny.tre”). The tree was rooted with the phylum Rozellomycota (including 2 genomes). The colors of the tree tips refer to the phylum of origin while the colors of the outer ring refer to the source database. (c) The phylogenetic tree of the panel (b) collapsed at the phylum level to provide a comprehensive overview on the Kingdom Fungi, using the latest taxonomic proposal25. The red dots refer to the branches supported by >70%.