Introduction

The fungal genus Aspergillus (subphylum Pezizomycotina, phylum Ascomycota) comprises some of the most important human opportunistic fungal pathogens. Invasive aspergillosis and chronic pulmonary aspergillosis1 impact ~250,000 and 3 million patients annually on a global scale, respectively2. Invasive aspergillosis mainly afflicts individuals with compromised immunity or other underlying conditions3,4,5,6,7. Mortality due to invasive aspergillosis varies among patient populations; ICU patients, as well as those with lung cancer, generally exhibit ~50% mortality8. In contrast to invasive aspergillosis, keratitis caused by Aspergillus spp. mainly occurs in immunocompetent patients after ocular trauma or contact lens use and can result in visual impairment and even blindness9. Fungal keratitis is estimated to cause over one million cases of blindness annually10.

Molecular barcoding studies from the last decade suggest that the most common infectious agents are Aspergillus fumigatus (4−52% of cases), followed by Aspergillus flavus (13−40%), Aspergillus niger (8−35%), and Aspergillus terreus (0−7%)11,12,13, depending on the disease. Despite belonging to the same genus, these species exhibit high levels of genomic sequence divergence; for example, A. fumigatus and A. flavus are as diverged as human and fish genomes14. Pathogenicity in Aspergillus has evolved independently multiple times, with species following various evolutionary trajectories to enable pathogenicity15. Alongside evolutionary differences, diverse geographic regions exhibit distinct epidemiological patterns in Aspergillus species prevalence. For example, the percentage of invasive aspergillosis cases caused by A. flavus varies by region, with ~10% of cases in the USA and Canada16 but ~40% in India13 attributed to A. flavus. The incidence of keratitis also exhibits regional differences in etiological agents9,17.

Despite several Aspergillus species infecting humans, research to date has focused primarily on A. fumigatus. Unlike A. fumigatus, A. flavus is of both clinical and agricultural interest18. A. flavus is notorious for producing highly carcinogenic mycotoxins known as aflatoxins. Aflatoxins are associated with billion-dollar crop losses annually, and consumption by humans and other animals is associated with cancer, stunted growth, liver failure, and even death19. The production of aflatoxins and an arsenal of other small, bioactive molecules termed secondary metabolites varies among populations of A. flavus, suggesting niche adaptation to specific microenvironments or competition20. In some fungi, including A. fumigatus, other secondary metabolites like gliotoxin can suppress host immune systems, serving as virulence factors15,21. Some strains of A. flavus and A. terreus also produce gliotoxin or its precursors22. While there is some evidence that kojic acid produced by A. flavus increases the toxicity of aflatoxin in some insect species23, to our knowledge, virulence factors that impact human infections have yet to be described for A. flavus. Due to the nature of opportunistic pathogens, traits enabling human infection are likely to benefit the fungus in certain environmental niches, as human infections arise from the environment rather than person-to-person transmission.

Individual isolates of A. flavus exhibit substantial phenotypic variation, termed strain heterogeneity. Environmental (i.e., plant- or soil-associated) isolates can cause disease in animal models of aspergillosis24 and keratitis25, but strains vary in clinically relevant traits such as growth under iron starvation conditions and virulence in animal models of fungal disease24,25. Additionally, A. flavus isolates may produce large or small sclerotia (a hardened mass of compacted mycelium capable of long-term survival under stressful conditions), resulting in L and S morphotypes, respectively; sclerotial morphotypes correlate with aflatoxin production and other cellular processes, including conidiation26.

Although genetic diversity within A. flavus has been studied using microsatellite markers in environmental27,28,29, veterinary30, and clinical31 contexts, these studies focused on a few loci and therefore examine only a small fraction of the total genomic divergence among the isolates. For example, a recent large study of A. flavus found high levels of genetic diversity and genetically isolated populations that vary in extent of recombination32. However, this study focused on agricultural isolates from the USA and did not include clinical (i.e., human-associated) isolates, leaving the relationship between environmental and clinical isolates unexplored. Although public databases such as NCBI’s GenBank and Sequence Read Archive contain many A. flavus genome assemblies and whole genome sequencing (WGS) datasets, only a small proportion of these are from clinical isolates33,34. WGS data provide opportunities to study not only fine-scale population structure (population genomics), but also gene presence and absence across all available genomes in a species (pan-genomics). In a pan-genome, the core genome is defined as genes present in nearly all individuals while genes that are absent from some individuals are called accessory genes. The strong conservation of core genes is thought to represent the core metabolism and housekeeping functions of a species35, with accessory genes encoding non-essential functions and possible local niche adaptations. Pan-genomic analyses are widespread in bacteria but have only recently been adopted in fungi36. In A. fumigatus, pan-genomes have been used to identify genetic variants associated with human pathogenicity, recombination rates, and relatedness of clinical and environmental isolates37,38,39.

In this study, we sequenced the genomes of 82 clinical isolates of A. flavus representing a diversity of human infection types. We combined our genomic data with publicly available genomic data to create a dataset of 300 (117 clinical and 183 environmental) genomes. We analyzed the genomes of clinical and environmental isolates to (a) infer the population structure of the species, (b) investigate the relationship between population structure and isolation environment (clinical vs. environmental), (c) define the pan-genome of A. flavus, and (d) identify genetic elements that are associated with clinical isolates.

Results

Five populations of A. flavus were identified, with clinical isolates overrepresented in one population

For genomes from clinical isolates sequenced in this study, short-read sequencing resulted in over 10 million paired-end reads for each of the isolates (Supplemental Data 2). Trimming resulted in 10,326,680 to 53,093,447 paired reads per isolate (Supplemental Data 2).

To explore the population structure of A. flavus, we analyzed 1,941,481 biallelic SNPs from 281 isolates, after clone correction. We found evidence of five populations based on admixture and DAPC analyses (Fig. 1). In addition to the previously described A, B, C, and S-type populations, we discovered a new population, D (Fig. 1; Supplemental Data 3). We calculated admixture coefficients for each isolate (Fig. 1A), revealing higher levels of admixture in populations A, C, and D than in populations B and S-type. As with the admixture analysis, the DAPC also provided evidence of five populations in our dataset (Fig. 1B), based on BIC scores for each cluster. A principal coordinates analysis using Euclidian distance showed considerable overlap between populations A, C, and D, but populations B and S-type were distinct from others along PC1 (Fig. 1C). The S-type population was the smallest (n = 9) and included isolates previously confirmed to produce S-type sclerotia. Population A (n = 114) was the largest, followed by population B (n = 68), population D (n = 54), and population C (n = 35). The reference strain, A. flavus NRRL 3357, was placed in population A. The designated type strain for the species, A. flavus NRRL 1957, was placed in population D.

Fig. 1: Population structure of Aspergillus flavus reveals genetic isolation reflecting five populations, including a new population, D.
figure 1

Analyses are based on 1,941,481 biallelic single nucleotide variants. A Estimates of individual ancestry, with K = 5, conducted using the software package LEA67, which estimates individual admixture coefficients from a genotypic matrix98. We estimated admixture for K = 2 through 10, with K = 5 providing the best fit for our data according to the cross-entropy criterion68,69. B Discriminant analysis of principal components shows admixture among populations A, C, and D as well as clear separation of populations B and S-type. Dots represent individuals and ellipses indicate group clustering of individuals. Populations are color coded as indicated in the top right. The discriminant analysis eigenvalues are shown on the bottom left, with the darker bars showing eigenvalues retained. C A principal coordinates analysis displaying relative genetic distances of individual isolates, here represented by dots, using Nei’s genetic distance matrix. Axes indicate the two principal coordinates retained and the percentage of variance explained by each coordinate. Populations A, C, and D varied primarily along PC2 rather than PC1; population B showed genetic differentiation from other populations and varied primarily along PC1. Single nucleotide polymorphisms can be inferred from the alignment provided as our Source Data file.

An overwhelming majority of isolates in both the S-type and B populations were from the USA, whereas the other three populations (A, C, and D) each contained isolates from at least five countries representing three or more continents. To examine the impact of geography on the population structure, we tested for isolation by distance40. The Mantel test statistic, or Pearson’s product-moment correlation r, lies between −1 and 1, ranging from a perfect negative correlation between the metrics tested and a perfect positive correlation, with 0 indicating no correlation. We saw evidence of isolation by distance in our dataset (Mantel test; r = 0.173, p = 0.001). We also saw evidence of isolation by distance within populations C and D, but not populations A or B (Fig. S1).

Population B was diverged from other populations, as Nei’s genetic distance was high between population B and all other populations (D was above 0.111 for all comparisons between population B and all others). Nei’s genetic distance was lowest between populations A and D (D = 0.043), indicating genetic similarities between the populations. Genetic differentiation across all five populations was consistent with the results of an AMOVA (Analysis of Molecular Variance) (global phi-statistic = 0.765; p = 0). Populations A, C, and D contained over 95% of the clinical isolates, whereas the S-type population contained exclusively environmental isolates and population B contained only three clinical isolates. Population D had the highest proportion of clinical isolates, with >85% of isolates in the population being patient-derived. Less than 40% of isolates assigned to population A, the largest population, were human-associated, in comparison. In population C, 51% of isolates were clinical. Clinical isolates originated from seven different countries: India, Japan, Netherlands, Germany, France, Spain, and the USA (Supplemental Data 1).

To examine the phylogenetic relationships among the isolates, we next constructed a maximum likelihood phylogeny using the full SNP dataset (1,941,481 SNPs), with A. minisclerotigenes (section Flavi) serving as an outgroup (Fig. 2). Clinical isolates were in all five clades, with each clade corresponding to a genetic population. All isolates with the “S” sclerotial morphotype were placed in a single monophyletic group, along with other isolates we infer to also be S-type (Fig. 2). Phylogenetic placement of isolates mostly supported the genetic populations (Fig. 2). Two clinical isolates were assigned to population A in the DAPC but exhibited considerable admixture with populations A and D (Fig. 1A). The maximum likelihood phylogeny recapitulated the DAPC results, with both isolates branching within the population A clade. However, in a neighbor-net network, one isolate was placed within population D rather than population A (Fig. S2). Unlike phylogenetic trees, neighbor-net networks capture additional nuance in relationships by including recombination. Based on the admixture analysis and neighbor-net network, these two clinical isolates, one from Germany, and one from India, possibly represent hybridizations between the A and D populations.

Fig. 2: Maximum likelihood phylogeny supports the existence of five populations and non-random distribution of clinical isolates across populations.
figure 2

Filled in circles along the outer track indicate clinical isolates; empty circles indicate environmental isolates. Branch colors correspond with population assignment based on the discriminant analysis of principal components (DAPC; Fig. 1): the S-type population is indicated in orange; population A in yellow; population B in blue, population C in pink, and population D in purple. The outgroup, Aspergillus minisclerotigenes, is represented in black. Apart from the outgroup, each tip represents an A. flavus isolate and branch lengths denote sequence divergence. Ultrafast bootstrap values are indicated below nodes. The phylogeny was constructed using 2,018,259 SNPs from a dataset of 300 A. flavus isolates and the outgroup A. minisclerotigenes. Source data are provided as a Source Data file.

To test whether clinical isolates were non-randomly distributed across the A. flavus phylogeny, we calculated Fritz and Purvis’s D statistic41; a value of 0 indicates a clumping of the observed trait (in this case, human pathogenicity) as expected under the Brownian motion model, whereas a value of 1 indicates a random distribution across the phylogeny. The model tests D against significant departure from 0 (Brownian motion model of evolution), as well as departure from 1 (random distribution). We calculated D for the dataset (n = 300 isolates) and observed a lack of random distribution (D = 0.245), which was significantly different from a value of 1 expected under random distribution (p < 0.0001), but not significantly different from a value of 0 expected under Brownian motion (p = 0.067). These results suggest that patient-derived isolates are not randomly distributed across the phylogeny; rather, they appear to predominately be found in certain populations (e.g., populations A, C, and D) and largely absent from others (e.g., population B).

Although we observed enrichment of clinical isolates in some clades, as supported by our D statistic, we recognize that our sampling of environmental and clinical isolates was uneven due to hospital culture collection location and availability of public data. We did however, include both clinical and environmental data from multiple areas of the globe. Despite the uneven sampling inherent in studies utilizing public data, populations with a large proportion of clinical isolates (A, C, and D) include isolates from multiple countries across several continents (Fig. 3; Supplemental Data 1). Furthermore, even though the majority of clinical isolates are from Europe, we still observe a higher proportion of clinical isolates in population D (54%) than in A or C (25% and 8%, respectively) following exclusion of European isolates. Population D includes isolates from seven countries, including clinical isolates from Japan, India, and the USA. In contrast, population B isolates are almost entirely environmental from the USA, although the isolates represent diverse regions within the country, and we did not see strong evidence for isolation by distance in this population, as discussed above. We also did not see evidence of isolation by distance in population A, our largest and most geographically distributed population. Notably, no Kenyan or Ethiopian isolates, all of which were environmental, were placed in population D. With the available data, we conclude that populations enriched in clinical isolates are more globally widespread than populations with fewer clinical isolates (S-type and B populations), although this finding warrants further testing through sequencing of isolates from additional geographic regions.

Fig. 3: The pan-genome of Aspergillus flavus is closed and contains 17,676 orthogroups.
figure 3

A Rarefaction curve of number of orthogroups added with each additional genome, excluding singletons. Data are presented as mean of 100 permutations with random ordering of genomes added with standard deviation indicated. B Histogram of orthogroup frequency determined by number of genomes in which each orthogroups is present. The core genome contains 10,161 orthogroups. The accessory genome of A. flavus contains 7515 orthogroups. C Heatmap of normalized abundance of gene ontology (GO) annotations with significant differences in abundance among populations. Significance determined by one-way ANOVA. Bonferroni-corrected, p < 0.05. The bar chart shows the mean number of genes containing each GO term across all genomes. Source data in the form of predicted proteomes are provided in the associated Figshare repository and Supplemental Data 6.

A. flavus isolates exhibit heterogeneity in gene content and genome size

Strain heterogeneity in gene content can impact diverse traits, including virulence and secondary metabolite production, so we also examined our dataset in using methods independent of the reference genome. Genomes from 82 clinical isolates sequenced in this study contained 17 to 4235 scaffolds (Supplemental Data 2). We also assembled genomes for an additional 16 clinical and 12 environmental isolates from publicly available sequencing data, resulting in genomes containing 700 to 3615 scaffolds (Supplemental Data 2). Publicly available genome assemblies of environmental A. flavus isolates contained 8 to 1821 scaffolds32,33,34,42,43,44,45,46. Genomes with low completeness based on BUSCO analysis (<95%) of the assemblies were excluded from the pan-genome analysis (Supplemental Data 4).

We examined variability in genome size among populations, which could indicate genetic expansions or streamlining. The mean genome size of population D was higher than both populations A and B (Tukey’s multiple comparisons test, adjusted p = 0.0175), with all other pairwise comparisons of population means being nonsignificant (Fig. S3).

We annotated all genomes using Funannotate47 to obtain consistent annotations for comparison. To ensure our annotation pipeline resulted in high-quality proteomes, we compared the Funannotate predicted proteome of NRRL 3357 to the RefSeq reference annotation and an additional transcriptome-based annotation of the same strain48; the two published proteomes contained 11 and 66 orthogroups that were not present in the Funannotate prediction, respectively, accounting for a tiny fraction of the gene content. Minor differences in gene prediction are expected due to gene fragmentation contributing to orthogroup variation. Overall, we are confident that the annotations predicted by the Funannotate pipeline are consistently high quality, enabling comparisons across isolates. The number of protein-coding genes predicted by Funannotate ranged from 11,461 to 15,501 across isolates (Supplemental Data 5).

The A. flavus pan-genome is closed and contains 17,676 orthogroups

To quantify the degree of gene presence-absence variation among isolates, we constructed a pan-genome of A. flavus. We used OrthoFinder to cluster predicted proteins into orthogroups, which were then compared across isolates and populations. Our pan-genome of A. flavus is closed (Heap’s law, alpha = 1.000023), with each genome after the 200th adding fewer orthogroups (Fig. 3A). We identified a total of 17,676 orthogroups. Of these, 10,161 (57.5%) orthogroups were in at least 95% of isolates; we consider these orthogroups to be the core genome. Within the core genome, 3375 orthogroups were single-copy and present in all isolates. The pan-genome of A. flavus exhibits a U-shaped distribution, as expected (Fig. 3B). The accessory pan-genome of A. flavus consists of 7515 orthogroups, of which 3387 (19.1% of all orthogroups) were in <5% of isolates and which we consider the “cloud” genome.

To explore which functional annotations were over or underrepresented within populations, we examined presence or absence and abundance of InterPro annotations and gene ontology (GO) terms. Populations A, C, and D, which are enriched in clinical isolates, shared much of their gene content and did not show any population-specific patterning of orthogroup presence or absence in a PCoA, but population B, did (Fig. S4). We infer that gene content among population B isolates is more conserved and distinctive from other populations, likely due to low diversity within the population. In addition, we examined the abundance of GO terms and InterPro annotations and compared the mean among populations, excluding the S-type population due to its small size. Populations had substantial differences in annotations and several GO terms were differentially abundant among populations (Fig. 3C). Given the over-representation of clinical isolates in population D, we focused on interpreting differences in functional annotations between population D and all other populations. Isolates in this population had a higher abundance of genes involved in many cellular processes, including certain types of hydrolase activity, nucleoside metabolic and carbohydrate metabolic processes, DNA-binding transcription factor activity, regulation of transcription, lipid metabolic process, NAD binding, catalytic activity, and acyltransferase activity (Supplemental Data 6; Fig. 4 and S5). Genes annotated with ferric iron binding functionalities were found in lower abundance in population D compared to other populations. Population B, the population with the fewest clinical isolates, was depleted in genes related to zinc ion binding (Fig. 4).

Fig. 4: Gene ontology (GO) terms more prevalent in population D than other populations include zinc ion binding and hydrolase activity, among others.
figure 4

Boxplots indicate the number of genes annotated with each GO term per isolate, by population assignment. The Y axis scale is adjusted for each GO term to better show differences among populations. In the box-and-whisker plots, the horizontal line in the box indicates the 50th percentile and the box extends from the 25th to the 75th percentile. The whiskers encompass the lowest and highest values within 1.5x the interquartile range. Individual dots represent statistical outliers. Statistical significance was determined using a one-way analysis of variance followed by post hoc testing using Tukey’s honest significance test (one-sided). The letters denote significances as a compact letter display where groups that are not significantly different from each other are indicated with the same alphabet letter; P < 0.05. After removing clones, low quality assemblies, and the S-type isolates, statistics were calculated for 232 isolates: population A (n = 81), population B (n = 69), population C (n = 29), population D (n = 53). Source data are provided in Supplemental Data 6.

Genes in a putative non-ribosomal peptide synthesis (NRPS) BGC with an unknown product were absent in  >90% of isolates within the S-type and B populations. The backbone gene for the NRPS cluster was present in all isolates in all populations, but additional biosynthetic or transporter genes within the BGC were absent from isolates within the S-type and B populations (G4B84_009247, G4B84_009246, G4B84_009245, and G4B84_009244 in A. flavus NRRL 3357, where all genes in the BGC are present). The BGC was previously identified as “BGC_44” on chromosome 6 but was not explored in depth20. GO terms associated with multiple genes within the BGC (OG0011498 [GO:0000981; GO:0006355; GO:0008270]; OG0011868 [GO:0003824, GO:0006807]; OG0011918 [GO:0003824]) were also differentially abundant among populations (Supplemental Data 6).

Additionally, orthogroups related to aflatoxin biosynthesis (GO:0045122) were abundant at different levels among populations (Supplemental Data 6). Predictions from antiSMASH indicate the aflatoxin BGC was present in 68.5% of isolates (n = 166/242). In line with previous research20, the BGC was present in almost all isolates in populations A, C, and S-type, but absent or degraded in many isolates within population B (Fig. S6). Interestingly, we found that the aflatoxin BGC was also absent or degraded in the newly defined population D, which has the most clinical isolates of any population.

Although virulence factors for human pathogenicity have not been identified in A. flavus, we examined 20 genes known to increase virulence in plant pathogenicity assays. All 20 genes were present in the isolates, but nucleotide percent similarity to the reference varied. For example, the stress-responsive transcription factor skn7 showed substantial variation among isolates; skn7 is part of the phosphorelay signal transduction system49 associated with GO:0000160, which we saw in lower abundance in population A compared to our other populations (Supplemental Data 6).

Three orthogroups with high variability in number of gene family members were correlated with human-association

We used phylogenetic generalized least squares (PGLS) models to examine the relationships between multiple variables including isolate source (clinical or environmental), genome size, number of tRNAs, number of predicted genes, and number of predicted biosynthetic clusters. Using these phylogenetically informed linear regression models, we observed a correlation between genome size and number of predicted genes (p = 0; adjusted R2 = 0.4891). We also examined the 10 orthogroups with the largest variability of gene family members from each isolate. We found that only three were significantly associated with isolate source (OG0000011, OG0000060, and OG0000270), all with low adjusted R-squared values (Supplemental Data 6); BLASTp results linked OG0000060 and OG0000270 with hypothetical proteins and implicated OG0000011 in natural product biosynthesis. OG0000011 was annotated with gene ontology terms “GO:0003824 catalytic activity” and “GO:0016746 transferase activity, transferring acyl groups,” which were both significantly differentially abundant among populations (Supplemental Data 7). None of the PGLS models showed significant correlation between orthogroups and DAPC population assignment (Supplemental Data 3).

Discussion

In this study, we examined the population structure, phylogeny, and pan-genome of A. flavus using genomic data from 300 isolates, including 82 clinical isolates sequenced by us. By combining genomic data for both clinical and environmental isolates of this pathogen, our study provides a rich dataset for future study and revealing fine-scale differences in pathogenicity within A. flavus.

Previous research using genomic data of environmental isolates from the USA described three populations of L-type (isolates producing large sclerotia) A. flavus isolates: A, B, and C32. With the inclusion of additional isolates, notably clinical isolates, our study identifies another distinct population, here termed population D, which contains the majority of clinical isolates. Clinical isolates were present in all L-type populations, but at different abundances—populations A, C, and D were enriched in clinical isolates, with population D containing the highest proportion. Isolates with the small sclerotial morphotype (S-type) grouped together in all analyses, as seen previously32, and only a single clinical isolate was placed in this group. We did not expect many clinical isolates to be part of the S-type population, as S-type isolates produce conidia at a far lower rate than L-type isolates26—interaction with these airborne asexual spores is how most patients are infected with Aspergillus species, so isolates producing fewer conidia are less likely to have spores interact with a human host.

Populations contained a combination of isolates from several different infections or microenvironments including soil, infections of different plant hosts (e.g., peanuts, corn, almonds), and different types of human infections (keratitis, aspergillosis, otomycosis, etc.), indicating a lack of specialization in populations. Previous work on environmental isolates showed no evidence of host specialization in A. flavus, with a single isolate able to infect both plant and animal hosts50, which is consistent with our observation of the lack of clustering of isolates from the same microenvironment.

Interestingly, clinical isolates were concentrated in populations A, C, and D, with population D containing the majority of clinical isolates and few environmental ones. Cryptococcus neoformans, another opportunistic fungal pathogen of humans, shows similar enrichments of clinical isolates in some clades compared to others51, whereas the most common human pathogen in the genus Aspergillus, A. fumigatus, does not37. In A. flavus populations A, C, and D, isolates did not cluster by country of origin, but we did observe a positive correlation between genetic and geographic distances in populations C and D, indicating that genetically divergent isolates were likely to be geographically distant. Geographic sampling within our dataset was not balanced, due to our use of public data and heavy sampling of clinical isolates within Europe. No environmental isolates from Europe had public data available, and none were available from large fungal culture collections, such as the NRRL or the Westerdijk Institute. The addition of environmental isolates from Europe in future studies will further test the validity of the inferred link between pathogenicity and population structure observed in our analyses.

We observed several important differences between A. flavus and the major human pathogen A. fumigatus. Our finding that some populations are highly enriched for clinical isolates in A. flavus contrasts from observations in A. fumigatus, wherein clinical isolates are more evenly distributed across all clades37, highlighting the importance of studying A. flavus as a pathogen rather than assuming that pathogenicity in the two species evolved similarly. Ecological differences among species contribute to the various clinical presentations and prevalence of Aspergillus species causing aspergillosis, such as the ability to form biofilms, or the size of conidia52. Likewise, genetic differences among species may explain some of the variance and prevalence of A. flavus related to A. fumigatus. A. flavus has a larger pan-genome than A. fumigatus, likely due to the difference in overall genome size between the species, with accessory genes composing a similar percentage of the pan-genome. Several pan-genomic studies have focused on A. fumigatus, with the core genome ranging from 55% to 69% of the pan-genome37,38,39, compared to our finding of 57% in A. flavus. A recent review stated that A. flavus clinical isolates were more similar to one another and exhibited lower diversity than clinical isolates of A. fumigatus, which were more genetically diverse53. A. fumigatus has many more genomes available from clinical isolates than A. flavus and the isolates are not associated with population structure as in A. flavus. However, our observation that clinical isolates are constrained to populations A, C, and D, which share genetic similarities and overlap in a principal coordinates analysis, supports a level of similarity among A. flavus clinical isolates despite deep phylogenetic divergences. We advocate for additional sequencing of clinical isolates, particularly from South America and Africa, as no genomes of clinical isolates are currently publicly available from these regions. Nevertheless, our analyses suggest that the core genome of A. flavus will remain similar even with the addition of new data; among the pan-genomic studies of A. fumigatus the core genome was consistent whereas the accessory genome varied based on input data37,38,39.

Within the pan-genome of A. flavus, we observed several differences among populations. Several GO terms were enriched in population D, often in biological processes like carbohydrate, nucleoside, and lipid metabolism. Molecular functions enriched in population D included hydrolyzing O-glycosyl compounds, a function of exo-polygalacturonases, which are involved in the degradation of plant cell wall polysaccharides54. None of the GO terms were directly implicated in functions typically related to pathogenicity, but several GO terms associated with metal acquisition were differentially abundant among populations. For example, zinc ion binding was depleted in population B. Zinc is an essential micronutrient for many fungal processes, and in A. fumigatus, deletion of a zinc acquisition factor attenuated virulence in a mouse model of aspergillosis55; however, we do not yet know whether the lower abundance of genes annotated to involve zinc ion binding would correlate with lower virulence in population B of A. flavus.

In other fungal infections of humans, secondary metabolites have been implicated in virulence56, most notably the role of gliotoxin in A. fumigatus infections. However, no secondary metabolites have been associated with A. flavus human infections. The most famous secondary metabolite produced by A. flavus is aflatoxin, which is not thought to be important for human infections as the optimum temperature for production of aflatoxins is below 37 °C, with transcription of the BGC dropping at higher temperatures57. The predicted BGC for aflatoxin follows previously reported population-specific patterns20, with presence or absence of the aflatoxin genes explained by clade and population.

Although both clinical and environmental isolates within populations A and C contained the aflatoxin BGC, isolates in population D often lacked genes related to aflatoxin biosynthesis. Population B, which included almost entirely environmental isolates, also lacked aflatoxin biosynthesis genes. Hospitals do not measure aflatoxin production for clinical isolates, leading to a paucity of production data for clinical isolates. However, it appears that although some clinical isolates in populations A and C may have the potential to produce aflatoxin, many clinical isolates in population D lack the necessary genes, and we expect them to be non-aflatoxigenic. The absence of the aflatoxin BGC within many clinical isolates of A. flavus, including almost all within population D, reinforces the apparent lack of association between aflatoxin and virulence in the context of human infections. Other predicted biosynthetic genes and gene clusters, such as BGC_4420, which had accessory biosynthetic genes more prevalent in population D than in other populations, have not been connected to metabolites and therefore their potential role in infection remains unknown.

In summary, we present evidence that clinical isolates of A. flavus share genetic similarities and are concentrated in certain populations rather than distributed across the phylogeny, particularly in an apparently non-aflatoxigenic, newly defined clade, which we have named population D. Clinical isolates from many countries and infection types are present in population D. We acknowledge that sampling was uneven and did not cover the full distribution of A. flavus, and advocate for additional sampling from regions underrepresented in this dataset (e.g., environmental isolates from Europe and clinical isolates from South America and Africa). Additionally, accessory genes and the aflatoxin BGC differ between populations, possibly providing future opportunities for distinct agricultural and clinical treatments. Although we did not discover a single genetic element that could explain the difference between clinical and environmental isolates of A. flavus, we did discover a new clade of A. flavus that appears to be enriched in clinical isolates, with distinct genetic features. This A. flavus genomic dataset and pan-genome provide a valuable tool for understanding why isolates from some populations of A. flavus appear to be more commonly responsible for human infections than other populations.

Methods

Retrieval of publicly available data

We retrieved data for 180 A. flavus isolates with paired-end Illumina whole genome sequencing data available on National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA) in July 2021, including data from 25 clinical isolates. In October 2024, we also retrieved paired-end whole genome sequencing data for 37 additional isolates deposited in SRA since the last search, composed of 10 clinical and 27 environmental isolates (Supplemental Data 1). A. flavus NRRL 3357, which has a chromosome-level genome assembly, was used as a reference. The publicly available data were from 10 countries: China (1), Ethiopia (7), India (7), Israel (1), Japan (21), Kenya (16), Netherlands (9), Pakistan (12), the USA (141), and Vietnam (1). Country of origin was unavailable for four isolates. Isolates represent diverse sources including patient-derived, soil, seed, and plant-associated microenvironments (Supplemental Data 1).

Collection of A. flavus clinical isolates and genome sequencing

We also sequenced 82 patient-derived isolates for this study, from 4 different countries: France (15 from the culture collection of the National Reference Center for Invasive Mycoses and Antifungals [CNRMA] at the Institut Pasteur), Germany (48 from the National Reference Center for Invasive Fungal Disease [NRZMyk]), Spain (7 from the National Centre for Microbiology [CNM] culture collection), and the USA (12 from the University of Texas M. D. Anderson Cancer Center). Isolates were from patients diagnosed with keratitis, aspergillosis, and otomycosis and were obtained through a variety of methods (Supplemental Data 1). Culture and DNA extraction methods varied and are available in the Supporting Information.

DNA libraries were prepared using the Nextera DNA Library PrepKit (Illumina, San Diego, CA, USA), according to manufacturer’s guidelines. Sequencing for the isolates from Germany, France, and the USA (75 total) was performed at Vanderbilt University’s sequencing facility, VANTAGE, using the Illumina NovaSeq 6000 instrument, following manufacturer’s protocols. Sequencing for the seven Spanish CNM isolates was performed using the Illumina MiSeq system, following the manufacturer’s protocols. All sequencing resulted in 150 bp paired-end reads. All isolates included in the study were confirmed to be A. flavus using whole genome sequencing data, in addition to preliminary classification based on morphology or marker genes.

Read mapping

By combining the publicly available data from 217 isolates with clinical isolates sequenced in this study, we complied a dataset of 300 A. flavus isolates Raw reads for all isolates were trimmed using Trimmomatic v0.3958 for paired-end data. Trimmed reads were mapped to the NRRL 3357 reference (GCA_009017415.1)59 using Bowtie2 v2.3.4.1 with default parameters60. We used SAMtools v1.661 to convert the resulting data files to BAM format and sort the BAM files. The AddOrReplaceReadGroups option in Picard tools v2.17.10 (https://broadinstitute.github.io/picard/) was used to append read group labels to BAM files. The Genome Analysis Tool Kit v3.8 (GATK) RealignerTargetCreator and IndelRealigner options were used to produce realigned BAM files62 and duplicates were removed using the MarkDuplicates option in Picard. We called variants for each genome using the GATK HaplotypeCaller option with -ploidy 1 for haploid organisms. GVCF files were combined using the CombineGVCFs option and the combined file genotyped using the GenotypeGVCFs option. Variants include single nucleotide polymorphisms, insertions, and deletions, so only SNPs were selected and retained. SNPs were filtered using the VariantFiltration option, with --filter-expression parameters “QD < 2.0”, “QUAL < 30”, “MQ < 40.0”, “MQRankSum < −12.5”, “SOR > 3.0”, “FS > 60.0”, and “ReadPosRankSum < −8.0”; other parameters were set as --cluster 8 and --window 10, according to the GATK best practices workflow (https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4).

Population genomics

Biallelic SNPs that passed hard filters were retained for further analysis for 300 isolates. Biallelic loci refer to loci which at the population level only have two alleles: the reference and an alternative. Clonal isolates were identified through genetic distance calculations in poppr v2.8.363 and were excluded from the dataset, leaving 281 isolates. We conducted a principal components analysis in R v4.3.1using adegenet v2.1.1064. Adegenet was also used for the discriminant analysis of principal components (DAPC), a multivariate method to determine the optimal number of genetic clusters for a given dataset65. The Bayesian Information Criterion (BIC) score was used to evaluate a range of possible numbers of genetic clusters from 1−15. The optimal number of clusters was determined by graphing the BIC score for each possible number of clusters. Calculations of missing data per population and sample, minor allele frequency, and Nei’s genetic distance were conducted using the R package SambaR v1.0966. The R package LEA v3.10.267 was used to estimate ancestry coefficients68, from K = 2 to K = 15. Optimal K (number of populations) was determined using the entropy coefficient method69.

To determine if molecular variation among populations was larger than variation within specific populations of A. flavus, we used Nei’s genetic distance70 to implement an AMOVA (Analysis of Molecular Variance) in R v4.3.1 using the R package pegas v1.371 with 1000 permutations. A two-way ANOVA implemented in PRISM 10 (Graphpad) was used to establish the relationship between isolate sources (soil, plant-associated, and human-associated) and genetic populations identified by DAPC.

In population genomics, correlations between physical and genetic distance of isolates can impact population structure. As such, we tested for isolation by distance40 using a Mantel test72 implemented in the R package dartR v2.9.773. Geographic location for clinical isolates was conservatively estimated by using the coordinates of each isolate’s culture collection. Nei’s genetic distance70 was used for the genetic distance matrix. For the geographic distance matrix, latitude and longitude was either (1) obtained from previously published data or public metadata from NCBI or (2) estimated from listed hospital location for patient-derived isolates or city of sampling for environmental isolates. Isolates without latitude or longitude locations were considered missing data and coded as “NA” in the table.

Phylogenomics

Using the SNPs called and filtered as described above, we reconstructed a phylogeny of all 300 A. flavus isolates, with the close relative Aspergillus minisclerotigenes (SRA: SRR12001146) as an outgroup. Only loci present in at least eight isolates were included. We used Lewis ascertainment bias correction to include only variable (non-constant) characters from our SNP data74. The phylogeny was constructed using IQ-Tree v2.2.2.675 using the ModelFinder Plus76 (-m MFP) option and 1000 ultra-fast replicates for bootstrapping. The F81 + F model was chosen by IQ-Tree as the best-fit model according to BIC. The consensus tree was used for visualization. We used iTOL v6 to visualize and annotate the phylogeny77. We also constructed a phylogenetic network using SplitsTreeCE v6.3.078 with default parameters to examine the relationships among isolates in a neighbor-net network.

To test whether clinical isolates were randomly distributed across the phylogeny or were more likely to be clustered (that is, whether clinical isolates had a phylogenetic signal), we calculated Fritz and Purvis’s D statistic for binary traits41 using the phylo.D command within the R package caper v1.0.379 with 1000 permutations.

Genome assembly and annotation

Draft genome assemblies were available from NCBI for 152 isolates (9 clinical and 143 environmental)25,32,42,43,44,80, as of June 2021. For clinical isolates sequenced in this study, 16 clinical isolates with public data, and a further 12 environmental isolates from NCBI, genomes were assembled using trimmed reads (described above). Each de novo assembly was performed using SPAdes v3.15.081 with default parameters except for k-mer count (set to 21, 33, 55, 77, 99, and 127). For all assemblies, scaffolds were filtered using Funannotate v1.8.1069 to remove duplicate sequences and those under 500 bp in length. Scaffolds were masked for repeats using RepeatMasker82 within Funannotate v1.8.10. Mitochondrial sequences and any bacterial, primate, or viral contaminants identified through routine screening upon submission to NCBI were removed from the genome assemblies. All genomes were evaluated for completeness using the BUSCO v4.04 Eurotiales database of 4,191 single-copy genes83. Several NCBI datasets did not have adequate sequencing depth for genome assembly, had contaminating reads, or had BUSCO scores below 95% and were therefore not included in the pan-genome analyses; we also removed clones as calculated using genetic distance based on SNPs. As such, we included 243 genomes in the pan-genome.

Gene predictions were generated by Funannotate v1.8.10 using the built-in gene models of Aspergillus oryzae (section Flavi) as predicted by EVidence Modeler84, with additional evidence provided in the form of predicted amino acid sequences for proteins from the A. flavus NRRL 3357 annotation59. To validate the gene-prediction procedure, we compared the new annotation of NRRL 3357 to two recent in-depth annotations48,59 using OrthoVenn285. Additional functional annotations were obtained through the “annotate” option within Funannotate that uses InterProScan v5.61.9386 with default parameters. Twenty-one genes associated with plant pathogenicity in A. flavus were previously identified from the literature25 and located in each genome via BLASTp87. Global alignments were conducted using the Needleman-Wunsch algorithm implemented in EMBOSS v6.6.088. Predicted biosynthetic gene clusters (BGCs) were identified using the fungal version of antiSMASH v6.089, with default parameters, and collated into a table format. For specific clusters of interest, we used BLASTn to confirm the presence or absence of backbone genes as defined by antiSMASH (core biosynthetic genes), e.g., querying the pksA nucleotide sequence from the A. flavus reference strain NRRL 3357 against all genomes to confirm presence.

Pan-genome analysis

We identified orthologous proteins using OrthoFinder v2.5.490 in all A. flavus genomes that had ≥ 95% completeness of the 4,191 genes in the BUSCO Eurotiales gene set83, resulting in a dataset of 247 isolates. The core genome was defined as in Lofgren et al. 38 as the set of genes that were present in at least 95% of isolates (in our dataset 236 or more); all other genes were considered part of the accessory genome. A subset of the accessory genome, the “cloud” genome includes orthogroups present in <5% of isolates. The presence/absence matrix of the accessory genome was visualized using the R package Complex Heatmap v2.16.091. We created a gene accumulation curve and gene frequency histogram using the R packages vegan v2.6-1092,93, philentropy v0.9.094, and ggplot2 v3.5.195. Using vegan v2.6-1092, we also calculated a distance matrix for the presence or absence of orthogroups within the accessory genome using Jaccard distance. The distance matrix was then used as input for a principal coordinates analysis (weighted classical multidimensional scaling) to visualize population-level differences in accessory genome content. The accessory genome principal coordinates analysis was visualized using ggplot2 v3.5.195. Also using vegan v2.6-1092, the gene accumulation curve was calculated using the “random” method and 100 permutations. The alpha for Heap’s law was calculated using the R package micropan v2.196. Orthogroups were considered population-specific when absent in all isolates of a particular population but present in >90% of isolates in other populations, consistent with definitions from Lofgren et al. 38.

Orthogroups were associated with locus tags and functional annotations using a custom Python v3.9 script. Analysis of functional annotation differences among the populations was performed by ANOVA in R v4.3.1 using the number of genes in each isolate’s genome that contained each annotation as input. Tukey’s HSD was used for post hoc testing of gene ontology term abundance with a significance cutoff of p < 0.05. Statistics and Bonferroni false discovery rate correction were performed using base R v4.3.1. Heatmaps were constructed using the R package Complex Heatmap v2.16.0, and box plots were made using ggplot v3.4.4.

We used a phylogenetic generalized least squares (PGLS) analysis as conducted in the R package caper v1.0.379 to evaluate whether traits were more likely to be shared by closer relatives in accordance with a Brownian motion model of evolution. PGLS analyses incorporate the phylogenetic relationships between individual data points when examining linear regression of variables97. We fit a model of genome size against the number of predicted genes, number of predicted tRNAs, and the number of predicted BGCs using a maximum likelihood estimate of lambda. We also fit a model to explain source (clinical or environmental) using the 10 orthogroups with the most variation in number of genes included in the orthogroup.