Introduction

Parastagonospora nodorum is a necrotrophic fungal pathogen causing septoria nodorum blotch (SNB) of wheat (Triticum spp.)1 leading to significant yield losses2. P. nodorum is primarily spread by infected seed, infested debris or by wind-dispersed sexual ascospores. Secondary infections can occur when water-splash spreads asexual pycnidiospores to higher leaves and glumes, causing further necrotic patches and crop loss. P. nodorum is observed to be highly diverse in the field3,4, and appears to regularly reproduce sexually5,6,7. This suggests that P. nodorum populations have a high capacity for adaptation, with potential for selective pressures to be quickly overcome by extant diversity.

P. nodorum infection relies on necrotrophic effector proteins (NEs), which are secreted into the host and cause disease symptoms upon recognition by cognate host susceptibility (S)-receptors8. Five NEs have been characterised (SnToxA9, SnTox110 SnTox311, SnTox26712 and SnTox513) and additional NE interactions have been proposed14,15,16,17,18,19,20,21. An additional ceratoplatanin-like effector homolog that is broadly conserved across plant-pathogenic fungi (SnodProt1) has also been characterised in P. nodorum22,23. Currently identified effectors have led to the deployment of resistant wheat cultivars24. Quantitative trait loci (QTL) that are associated with disease-resistance indicate additional effectors, which if characterised can provide further crop improvement. However, epistatic interactions of SnTox1 and SnTox267 over SnTox316,25 indicate that combined interactions between multiple effectors may be complex and may vary under different conditions. Reliable markers for host S-genes and an improved understanding of NE epistatic interactions are important for ongoing disease-resistance breeding. These advancements in crop-protection rely on the prior discovery of NEs26 and upon accumulating genomic and bioinformatic resources27 which have enabled effector discovery across multiple pathogen species28.

P. nodorum was among the first fungal species for which a reference genome sequence was generated (Western Australian (WA) isolate Sn15)29, and the first species of the class Dothideomycetes that comprises several important cereal pathogens30,31. Since its initial genome analysis, the Sn15 isolate has become an important reference and model for cereal necrotrophs1, accumulating significant bioinformatic resources over time, including transcriptomic29,32,33,34,35, proteomic35,36, and metabolomic37,38,39,40 datasets. Chromosome-scale reference genome assemblies have been generated for four isolates: the Australian Sn15 isolate and 3 USA-derived isolates: LDN03-Sn4, Sn2000 and the avirulent/Agropyron-isolated Sn79-108734,41.

The study of effector content and of other genomic features that may contribute to the virulence of P. nodorum is ongoing. A ~ 400 kb accessory chromosome, typically designated chromosome 23 (or AC23) is absent from Sn79-108742 and is highly mutated34,41. Regions high in RIP-like mutations and AT-rich sequences were observed around repeat-rich stretches of AC23 and sub-telomeric regions of other chromosomes34,41,43. Candidate secreted effector-like proteins (CSEPs) have been predicted based on an ensemble of features including predicted secretion signals, sequence-based or structural homology to known effectors, positive selection, presence-absence variation (PAV), genomic location (including: G:C content, distance to telomeres, and proximity to transposable elements), genome-wide association34,41,43,44, and predictive models trained on the physicochemical properties of known fungal effectors45,46. For the Australian reference isolate Sn15, CSEP predictions have been combined with additional supporting experimental and bioinformatic indicators, including: in planta gene expression33, predicted lateral gene transfers with other cereal-pathogens (https://effectordb.com), and priority-ranking based on aggregation of multiple prediction types41,43.

Decreasing costs of genome sequencing over the last decade has progressively shifted focus from the study of solitary reference isolates to comparative genomics at increasingly larger scales. Three pangenomic comparative studies of P. nodorum have been conducted on regional scales, including isolates from Iran, Finland, Sweden, Switzerland, South Africa, the USA, and Australia41,43,44,47. Iran appeared to be the most genetically heterogeneous region, reflecting a longer history of host co-evolution during the early domestication of wheat in the fertile crescent48. Positive selection pressures and presence-absence variation (PAV) have been observed for effector loci and for accessory sequences with potential roles in virulence43. Pangenome-based surveys of fungicide-resistance adaptations have been performed across Australia, Iran, South Africa, Switzerland, and the USA47, indicating higher incidences of azole resistance in Switzerland. A pangenomic survey of isolates infecting Spring, Winter and Durum wheat across the USA44 identified 2 sub-populations corresponding to geographic regions and host wheat lines. Presence of effector loci was variable, with SnToxA, SnTox1 and SnTox3 being absent in 37%, 5% and 41% of US isolates respectively, and SnToxA being mostly absent in one sub-population. Collectively, these studies highlight the regional profiles of pathogenicity factors in P. nodorum and the emerging diagnostic potential of pangenomic surveys.

Genomic diversity of P. nodorum in Western Australia (WA) was initially surveyed using 28 simple sequence repeat (SSR) markers versus 55 WA isolates collected over a period of 44 years, and contrasted to 23 French and US isolates49. This prior study indicated two core admixed sub-population groups in WA, and at least three homogeneous groups that were restricted both geographically and temporally. Population shifts between these groups over time appeared to correlate with the historical preference for different wheat cultivars, and was prominent from 2013 when mass adoption of the SnToxA-insensitive “Mace” comprised up to 70% of areas sown50. Although overall disease-resistance among wheat cultivars may have increased over time, recently sampled isolates from emergent clusters were also reportedly more aggressive49. In this study, we have generated pangenome resources corresponding to this prior survey (Fig. 1, Supplementary Data 1). In corroboration with previous findings, we observed the WA P. nodorum population was separated into a core population and a handful of small, homogeneous sub-population groups. We generated a panel of orthologous genes that represent the observed gene content across the P. nodorum pangenome, and have used this panel to predict effector candidates, and note the subtle influence of repeat-induced point mutations (RIP) upon the evolution of this model cereal necrotroph. As effectors are the key determinants in necrotrophic interactions with host sensitivity loci8, we mined the P. nodorum pangenome for protein isoforms of known effectors, and report on isoform diversity between isolates sampled across Western Australian wheat-growing regions49 and a representative panel of international isolates43.

Fig. 1
Fig. 1
Full size image

Locations across Western Australia (dark grey) where Parastagonospora nodorum isolates were sourced for whole-genome sequencing.

Results

Phylogeny and structure of the Western Australian P. nodorum population

Mean genome size across the pangenome was 37.8 Mb per isolate (Supplementary Data 2) with an average of 18,392 annotations per isolate (Supplementary Data 3). There was an average of 6% repetitive DNA, comprised of 3% LTR retrotransposons, 2% DNA transposons, and 1% MITEs (Supplementary Data 4). There were 1,340,429 SNP variant sites detected across the pangenome relative to the Sn15 reference (Fig. 2), with RIP-like C:G↔T:A mutations comprising 78% of SNPs (Fig. 3). However SNPeffect analysis vs Sn15 annotations indicated that only 136,860 RIP-like (10.2%) and 78,401 (5.8%) non-RIP-like SNPs caused non-synonymous amino-acid changes (Fig. 3). For effector loci present in Sn15 (SnToxA, SnTox1, SnTox3 and SnTox267), 33% of RIP-like SNPs corresponded to non-synonymous changes and 73% to synonymous changes (Supplementary Table 1). Filtering of sequence variants relative to the Sn15 reference isolate produced 6787 bi-allelic, conserved SNPs occurring in ≥95% of isolates. A phylogenetic tree and sub-population groups predicted using this data indicated 6 groups, with Iranian isolates strongly associated with group 3, and US and European isolates assigned to groups 3 and 4 (Figs. 2 and 4, Supplementary Fig. 1). The majority of WA isolates were assigned to group 4 representing the core WA population (equivalent to groups 1 and 2 from a previous SSR-based study49). However a handful of phylogenetically-similar and regionally-proximal clades corresponded to other groups (1, 2, 5 and 6), which were also indicated in the previous study. Isolates assigned to these groups were typically collected from, but not exclusively representative of, the northern Geraldton region (Fig. 4). Interestingly, the Sn15 reference isolate was assigned to group 2 and is not a typical representative of the core WA population (group 4).

Fig. 2: Summary of mutation across the Parastagonospora nodorum pangenome, relative to the Sn15 reference isolate.
Fig. 2: Summary of mutation across the Parastagonospora nodorum pangenome, relative to the Sn15 reference isolate.
Full size image

Rings proceeding inwards represent: Sn15 chromosomes from 1 to 23 (black/grey, accessory chromosome = red), with labels indicating the locations of 4 effector loci present in Sn15; G:C content (grey); gene density (green); Predector (effector-likelihood) score (green/red); repetitive DNA density (red); composite RIP index (CRI) (green/red); SNP site density (blue = total, yellow = (RIP-like) transition mutations, red=non-synonymous RIP-like transitions; presence-absence variation (PAV) relative to all isolates (red). The geographic region, predicted phylogeny, and population grouping of isolates are indicated alongside corresponding PAV tracks.

Fig. 3: Summary of SNP mutation sites (left) detected across the Parastagonospora nodorum pangenome relative to the Sn15 reference isolate.
Fig. 3: Summary of SNP mutation sites (left) detected across the Parastagonospora nodorum pangenome relative to the Sn15 reference isolate.
Full size image

SNP mutation sites were categorised (middle) into RIP-like (C↔T or A↔G SNPs) and Other/non-RIP-like (not C↔T or A↔G SNPs) and by their predicted effects on protein-coding genes (right).

Fig. 4: Structure and pathogenicity features of the Western Australian (WA) Parastagonospora nodorum population.
Fig. 4: Structure and pathogenicity features of the Western Australian (WA) Parastagonospora nodorum population.
Full size image

SNP-derived phylogeny (left) of Western Australian and internationally-sampled P. nodorum isolates (see legend), shows: isolates (branch labels); sampling year; sub-population groups (green) from this study (right) and a previous study (left)49; mating-type loci (blue); and the presence of effector loci (red-middle) and effector protein isoforms (red-right, from left-to-right: highest to lowest frequency). An alternate version with overlaid branch lengths is presented in Supplementary Fig. 1.

Effector protein isoform profiles were consistent with phylogeny

The presence of known necrotrophic effector (NE) loci SnToxA (represented by Parastagonospora nodorum ortholog group (SNOO) SNOO_16571A), SnTox1 (SNOO_20078A), SnTox3 (SNOO_08981A), SnTox267 (SNOO_14493A) and SnTox5 (SNOO_50320) was ubiquitous across WA, with the majority of isolates possessing all 5 NE loci (Fig. 4). Infrequently, SnToxA, SnTox1, SnTox3 and SnTox5 loci were absent, although this presence-absence variation was more common among international isolates and rare among WA isolates. Notably, SnTox5 was consistently absent from sub-population group 2 which included the Sn15 reference isolate, yet absence of SnTox5 was not observed among international isolates. At the protein isoform level, NE profiles of WA isolates were distinct from international isolates. Across WA, dominant isoforms and less frequent secondary isoforms were observed, and additional isoforms were rare. NE Isoform profiles also tended to conform to the predicted phylogenetic structure (Fig. 4).

Leveraging comparative pangenomics and function for prediction of effector candidates

There were 34,381 clusters of orthologs predicted across the P. nodorum pangenome, with 14,050 (40.9%) core groups present in all isolates, 11,470 (33.3%) variable (accessory) groups and 8861 (25.8%) singleton groups (Supplementary Figs. 2, 3, Supplementary Data 5). Rarefaction analysis of ortholog group presence across all isolates indicated this dataset represents a ‘closed’ pangenome51 (Supplementary Fig. 4). After functional annotation, there were 19,465 groups (56.6%) remaining with no informative matches. Based on dN/dS branch site tests, 5294 groups (15.4%) were under positive selection. Accessory orthogroups tended to be closer to repeat and telomere regions, with lower dN/dS and higher FYKIN:GAP ratios52 that would indicate relative increase in diversifying selection driven by RIP mutations (Supplementary Fig. 3, Supplementary Data 6). Singleton orthogroups tended to have slightly higher Predector scores that may indicate effector-like properties. Accessory orthogroups appeared to be enriched in several functions including cell death, membrane transport, regulation of transcription and DNA replication. Singleton orthogroups were also enriched in functional annotations related to protein repeats, protein-protein interactions, ubiquitinilation and viral replication (Supplementary Data 6).

Prediction of candidate secreted effector proteins (CSEPs, see methods) resulted in 186 orthogroups, of which 69 (37.1%) had functional information, and 17 (9.1%) were under positive selection (Supplementary Data 7). The 69 functionally-annotated CSEPs included ortholog groups corresponding to 6 known P. nodorum NE loci SnToxA, SnTox1, SnTox3, SnodProt122,23, SnTox5 and SnTox267 at predicted ranks 2, 3, 13, 27, 42 and 43 respectively (Supplementary Data 8). Other groups were homologous to several effector loci identified in other plant-pathogen species, including MoCDIP453, MoAAT54, FgXYLA55,56, CfTom157,58, MoSPD5/MoBas459,60, and Mycgr3G3810561 (Table 1, Supplementary Data 8).

Table 1 Summary of 69 candidate secreted effector protein (CSEP) ortholog clusters of the P. nodorum pangenome, including 6 confirmed effectors - ranked by Predector score, filtered for: predicted secretion, Predector score ≥2, ≥2 cysteines, excluding singletons, and including functional information

Discussion

Previously the structure of the WA P. nodorum population was assessed with SSR markers from which 5 sub-population groups were predicted49. Two of these groups were proposed to represent a gradual change over time in the core population in response to wheat cultivar use, while the remaining homogeneous clusters may be clonally-expanded populations. In this pangenome-based study 6 sub-population groups were predicted, with a core WA group (group 4) and geographically-restricted clonal groups. The ratio of mating-type loci in the core population was close to 1:1 indicating heterothallic meiotic potential, in line with previous reports from WA5,6 and elsewhere62. In contrast, the clonal sub-groups only had a single mating type and were thus asexual (Fig. 4), with 1 exception. A single WA isolate (group 6) and a single Iranian isolate both appeared to match both mating-type loci (Fig. 4). This may potentially indicate contamination of those samples where more than one isolate has been sequenced, or alternatively this can indicate a spontaneous shift to homothallism which may occur rarely. The clonal sub-groups exhibited a similar proportion of RIP-like SNP mutations relative to the core population (~80%) with the exception of group 2 (96%), which notably contains the reference isolate Sn15 and consistently lacked the Tox5 locus (Fig. 4, Supplementary Data 9). Clonal sub-groups were also primarily collected from the northern Geraldton region, which is relatively hotter with less rainfall63. High temperatures have been negatively correlated with P. nodorum disease load64. Conversely rainfall and splash dispersal have been associated with higher disease loads1,64, and rain impacts may also promote airborne dispersal over longer distances65. The combination of these climatic factors may have contributed to the homogeneity across this region. Furthermore, the phylogeny of the core-group did not indicate strong association with geographic regions. Long-range wind dispersal of sexual ascospores has been reported in WA5 and dispersal by infected seed is also a possibility66,67,68. Speculatively, the population structure of P. nodorum may be less dependent on geographic distance, when compared to influences of climatic and anthropic factors.

Presence of necrotrophic effector loci that correspond to cognate host sensitivity receptor loci is a useful predictor of the outcome of P. nodorum infection26. Previously, discrete sets of effector candidates were predicted for two US sub-populations44, highlighting the importance of region-specific analysis. In this study focused on the Western Australian wheat belt region, conserved effector isoform profiles for effector loci SnToxA, SnTox1, SnTox3, SnTox267 and SnTox5 generally conformed to phylogenetic structure (Fig. 4). Despite the extreme genome plasticity of fungal genomes52,69,70,71 and the unsurprisingly high levels of RIP-like mutations observed across the P. nodorum pangenome (Figs. 2 and 3), relatively little effector protein isoform diversity was observed across WA (Fig. 4). Effector loci of Sn15 are located at or near telomeres, which are hotspots for TEs, SNPs, intrachromosomal recombinations, duplications, and positive selection34,42,69,70,72. Yet, only a strongly dominant isoform and an infrequent secondary isoform were observed, and if present additional isoforms were extremely rare. RIP appears to be a strong driving force causing many DNA-level mutations across the entire landscape of the P. nodorum pan-genome, presumably due to “RIP-leakage”73 which is frequently observed in the Pezizomycotina52 and extends up to (at least) 4-5 Kb from a RIP-targeted repeat74,75. Although the majority of protein-coding genes of P. nodorum are within 2-3 Kb of their nearest repeat (Supplementary Fig. 3), RIP-leakage in P. nodorum is balanced by strong selection against mutations causing amino acid changes, even for necrotrophic effector loci. The ratio of RIP-like to non-RIP mutations for all loci (78401:136860 = 0.57) was the same as that observed for known effectors (9:18 = 0.57, Supplementary Table 1). The pathogenic fitness of biotrophs and hemibiotrophs52,76 can benefit from RIP-driven pseudogenisation of effector or other PAMP-producing loci, however this does not typically apply to a necrotroph like P. nodorum. Consequently, these observations suggest that most RIP mutations altering protein-coding gene regions are strongly selected against, to avoid deleterious losses of function.

Pathogen pangenomics has the potential to enable affordable genome-based crop disease surveillance tailored to local regions41,44,77. This study focusses on a population of the wheat pathogen Parastagonospora nodorum from the Western Australian wheat belt region. The collective bioinformatic resources for P. nodorum pathogen have significantly improved over time, including the development of approaches to pangenomic analysis at regional scales. By aggregating multiple predictive methods and data sources, a stringent set of 69 candidate effectors has been generated that may guide experiment-validation and discovery of effectors. Alternate reproductive modes were also observed in some regions, highlighting the potential need for differential disease management under altered population growth conditions. At the genome-level we observed high potential for adaptability, indicated by widespread RIP-like mutations that appeared to drive heterogeneity at the DNA level. Counter-intuitively, there were relatively few mutations retained at the protein isoform level, even within necrotrophic effector loci typically associated with mutation hotspots. In P. nodorum and potentially other necrotrophs, the majority of RIP-driven heterogeneity may be purged by strong selection against non-synonymous mutations, resulting in relative homogeneity across its ‘pan-proteome’. Regardless, there is encouraging potential to extend pangenome-based insights and the effector isoform profiling approaches described here to future plant pathology applications. The reduction of total gene content and SNP-level diversity down to simplified isoform profiles could be used as an alternative to traditional and haplotype-based pathotyping48,78, and GWAS approaches testing for SNPs associated with cultivar susceptibility12,13,44. In this manner, despite the vast potential for DNA mutation observed for most fungal pathogen genomes, future effector studies that use isoform profiling may be less prone to RIP-related errors.

Materials and methods

Whole genome sequencing of Western Australian P. nodorum isolates

Genomic DNA of 141P. nodorum isolates49 sampled across the Western Australian wheat-belt region (Fig. 1, Supplementary Data 1) were extracted79 and sequenced by the Australian Genome Research Facility (Melbourne, Australia) (Illumina HiSeq2500, TruSeq PCR-free, 125 bp paired end (PE), 600 bp insert size) [NCBI BioProject: PRJNA612761]. Genomic DNA of 17 new isolates and 2 repeated isolates (14FG141 and Mur_S3 from the previous 141) were extracted with the Qiagen DNeasy Plant Mini kit (Venlo, Netherlands. Catalogue ID: 69104) and sequenced by Novogene (Beijing, China) (Illumina HiSeq2500, TruSeq PCR-free, 150 bp PE, 350 bp insert size). Data from prior studies was also used, including draft genomes of 15 international P. nodorum isolates43 [NCBI BioProject: PRJNA476481]; and chromosome-scale genome assemblies for Western Australian reference isolate Sn1541 [NCBI Assembly: GCA_016801405.1], and; US isolates LDN03-Sn4 [NCBI Assembly: GCA_002267005.1], Sn2000 [NCBI Assembly: GCA_002267045.1] and Sn79-1087 [NCBI Assembly: GCA_002267025.1] [NCBI BioProject: PRJNA398070]34.

Reads were trimmed with CutAdapt v1.18 (2 passes, 3 trims/pass, terminal Phred score >2, average Phred score ≥5, length ≥50)80 and BBduk v38.38 (read kmer coverage 0.7)81 versus UniVec (https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) and PhiX (NCBI RefSeq: NC_001422.1)82. Sample contamination was checked with Kraken v2.0.783 versus NCBI Refseq (bacteria, archaea, protozoa, virus, and fungi: downloaded: 2019-03-16), and human GRCh3884, as well as to 4 reference P. nodorum genomes as a positive set34,41. Insert size and completeness was assessed by alignment to Sn15, LDN03-Sn4, Sn2000 and Sn79-1087 genomes with BBmap v38.3881 and quality control statistics were assessed with FastQC v0.11.8 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), Bbmap, Samtools85, and MultiQC86 witihn the qcflow pipeline87.

Variant calling relative to the P. nodorum Sn15 reference isolate

Reads were aligned to the Sn15 ref. 41 with bwa mem 0.7.17-r119888 and outputs were converted to aligned BAM format with GATK 4.2.6.1 (MarkIlluminaAdapters, MarkDuplicates, MergeBamAlignment -CREATE_INDEX -ADD_MATE_CIGAR)89. Sequence variants relative to Sn15 were generated in gVCF format with GATK HaplotypeCaller (-ERC GVCF –minimum-mapping-quality 20 –min-base-quality-score 20 -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation) and isolates were genotyped with GATK CombineGVCFs and GenotypeGVCFs, filtering variants with GATK VariantFiltration for SNPs (QD < 2, QUAL < 30, SOR > 3, FS > 60, MQ < 40, MQRankSum < −12.5, ReadPosRankSum < −8) and InDels (QD < 2, QUAL < 30, FS > 200, ReadPosRankSum < −20). Filtered variants resulting in non-synonymous or nonsense mutations relative to Sn15 gene annotations were identified with SnpEff90.

To predict population structure groups, VCFs were converted to PLINK bed format using PLINK v1.90b7, used as input to fastSTRUCTURE v1.091. Initially, for K = 1–12, twelve independent runs were performed with default parameters and the function “chooseK.py” was used to select the optimal run. To predict phylogeny, ≤2 bi-allelic and conserved (≤5% missing data) SNPs were randomly selected within 5 kbp increments with BCFtools92 (view --max-alleles 2 -e ‘F_MISSING < = 0.05; +prune -l 0.9 -w 5000 bp -n1 -N rand) and used to predict a phylogenetic tree with IQTree v2.0.3 (-bb 1000 -alrt 1000)93. The SNP-derived phylogenetic tree was visualised with iTOL v594 alongside geographic location, mating-type genes, FastSTRUCTURE-based population groups, previously published SSR-marker-derived population groups49, and pathogenicity effector profiles.

Previous studies have established repeat-induced point mutation (RIP) in P. nodorum41 and broadly across many other fungal species52 have a strong bias for mutation of CpA to TpA dinucleotides. Therefore, bi-allelic SNP variants which were comprised of either “C” and “T” allele pairs, or the reverse complement “A” and “G”, were designated “RIP-like” for subsequent analysis. SNP variants relative to Sn15 were also used to calculate Composite RIP Index (CRI)95.

De novo genome assembly of Western Australian P. nodorum isolates

Overlapping read pairs were merged with BBmerge v38.3896 (strict = t k = 62 rem = 50 ecctadpole = t) and combined with unmerged pairs for de novo genome assembly with Spades v3.13.097 (--careful --cov-cutoff auto). Mitochondrial genomes (mtDNAs) were assembled with Novoplasty v2.7.298, seeded with the Sn15 mtDNA [NCBI RefSeq: EU053989.1]29 (k = 31-81, selected for min. contigs with assembly size=47-52 Kb) (Supplementary Data 2) (via mitoflow v.1099). Nuclear assemblies were filtered for mtDNA with minimap2 (git commit 371bc95)100 (≥95% coverage, median depth > = 99.2% total depth). Assembly quality was assessed with Quast v5.0.2101, bbtools v38.3881, and KAT v2.4.2102 (via postasm v1.0103). Genome assemblies were aligned to Sn15 [NCBI Assembly: GCA_016801405.1]41 with nucmer v4.0.0beta2 (--maxmatch)34,104. Mean coverage within non-overlapping 50 Kb windows was calculated with BEDTools v2.28.0105 and visualised with circlize106.

Annotation of DNA repeats and non-protein coding gene features

DNA repeats were predicted using a combination of tools (Supplementary Data 4): EAHelitron (git commit c4c3dca)107, LTRharvest108, LTRdigest (genometools v1.5.10)109, MiteFinder (git commit 833754b)110, RepeatModeler v1.0.11111, and RepeatMasker v4.0.9p2112 (-species “Parastagonospora nodorum”). Putative transposable element (TE) protein-coding regions were predicted with MMSeqs2 v9-d36de113 versus selected Pfam families, GyDB families114, and a custom MSA database sourced from TransposonPSI (http://transposonpsi.sourceforge.net/) and LTR_retriever115 (via PanTE v1.0116).

Predicted TE sequences from EAHelitron, MiteFinder, RepeatModeler, and MMSeqs protein finding were clustered with VSEARCH v2.14.1117 (--cluster_fast combined.fasta --id 0.90 --weak_id 0.7 --iddef 0 --qmask dust), filtered for >=4 copies in >=20% isolates, aligned with DECIPHER v2.10.0118, classified into subtypes with RepeatModeler (RepeatClassifier), and mapped to each isolate assembly with RepeatMasker. Non-coding rRNA and tRNA features were predicted with RNAmmer v1.2119 and tRNAscan-SE v 2.0.3120. Genome assemblies were soft-masked with TE and non-coding RNA features with BEDTools105.

Annotation of protein-coding genes

Data supporting gene annotation in the P. nodorum pangenome was derived from multiple sources. Previous annotations for Sn1541, LDN03-Sn4, and Sn7934 were mapped to all assemblies with Spaln v2.3.3 (-KP -LS -M3 -O0 -Q7 -ya1 -yX -yL20 -XG20000)121. Fungal proteins from UniRef50 (release 2019_08, downloaded: 2019-10-29, taxonomy = “Fungi [4751]” AND identity = 0.5) were aligned with Exonerate v2.4.0 (--querytype protein --targettype dna --model protein2genome --refine region --percent 70 --score 100 --geneseed 250 --bestn 2 --minintron 5 --maxintron 15000 --showtargetgff yes --showalignment no --showvulgar no)122 with pre-filtering using MMSeqs2 (-e 0.00001 --min-length 10 --comp-bias-corr 1 --split-mode 1 --max-seqs 50 --mask 0 --orf-start-mode 1). RNAseq reads for Sn15 in vitro and 3 days post infection on wheat leaves33 [GEO: GSE150493; SRA: SRX8337774-SRX8337777, SRX8337782-SRX8337785] were de novo assembled into transcripts using Trinity v2.8.4 (--jaccard_clip --SS_lib_type FR)123. RNAseq reads were also aligned to all assemblies with STAR v2.7.0e124 and assembled into transcripts with StringTie v1.3.6 (--fr -m 150)125. Assembled transcripts were aligned to genomes using Spaln v2.3.3 (-LS -O0 -Q7 -S3 -yX -ya1 -Tphaenodo -yS -XG 20000 -yL20)121, and GMAP v2019-05-12126.

Protein-coding gene annotations (Supplementary Data 3) were predicted in several stages. Initial predictions for each isolate used multiple tools: PASA2 v2.3.3 (-T --MAX_INTRON_LENGTH 15000 --ALIGNERS blat --transcibed_is_aligned_oriented --TRANSDECODER --stringent_alignment_overlap 30.0)127, GeneMark-ET (--soft_mask 100 --fungus)128, CodingQuarry v2.0 (standard and “pathogen mode”), Augustus (git commit 8b1b14a, iindependently for forward and reverse strands; --hintsFile = hints.gff3 --strand = $ --allow_hinted_splicesites=’gtag,gcag,atac,ctac’ --softmasking = on --alternatives-from-evidence = true --min_intron_len = 5)129, and GeMoMa v1.6.1 (Sn15 annotations only)130. PASA2 predictions used GMAP- and BLAT-aligned RNASeq data. Augustus predictions used GMAP alignments, STAR intron features, and Spaln protein alignments as hints. PASA2, Augustus, and CodingQuarry predictions were clustered with MMSeqs2 (90% identity, 98% reciprocal coverage) and transferred with GeMoMa. Outputs from Genemark-ET, CodingQuarry, Augustus, PASA, GeMoMa, Exonerate, Spaln protein and transcript alignments, and GMAP alignments were combined using EVidenceModeler (git commit 73350ce) (--min_intron_len 5)131. Augustus (all hints, parameters as above) was used to predict additional genes not overlapping with EVidenceModeler outputs.

Multiple steps were then taken to ensure accuracy and reliability of annotations across the pangenome. Pseudogenes were screened with AntiFam132 using HMMER v3.2.1 (--cut_ga). Annotations were considered “low confidence” if supported only by Spaln or GMAP transcript alignments, Exonerate protein alignments, or transfers of annotations between isolates performed via GeMoMa (unless derived from previously curated Sn15 annotations), or for Sn15, if supported only by the above, or Augustus. “Low-confidence” annotations overlapping annotations on either strand by more than 30% of their length were discarded. Frame/phase-shift annotation errors in outputs were corrected by mapping to all annotations of all isolates without internal stop codons, and all Pezizomycota proteins from UniRef-90 (2020-05-13; taxonomy: “Pezizomycotina [147538]”; identity:0.9’) with blastx v2.10.0 (-strand plus -max_intron_length 300 -evalue 1e−5)133. In-phase matches lacking internal stops were retained, out-of-phase matches with stops were marked as pseudogenes, and annotations with internal stops and no matches were discarded. Annotations overlapping predicted rRNA genes (≥50% length) were discarded. Annotations with exons spanning assembly gaps were split in separate annotations if ≥60 bp. Annotation completeness was evaluated with BUSCO v3 (pezizomycotina_odb9)134 with additional statistics collected with genometools v1.5.10135. For Sn15, updated annotations were compared to previous versions41 with ParsEval/AEGeAn v0.15.0136 and BEDTools (bedtools subtract -a new -b old -s -A -F 0.2).

Orthology & positive selection

Orthology relationships were predicted with Proteinortho v6.0.30 (-singles -seflblast)137 and Diamond v 2.0.8138, with alternate transcript isoforms (Sn15 only) allowed to cluster into separate ortholog groups. The prefix ‘SNOO_‘ was assigned to clusters, with a numerical suffix based on: the “SNOG” locus numbers of corresponding Sn15 annotations41, or sequential numbers starting from 50,000 if not present in Sn15. Alphabetical suffixes also indicate Sn15 isoforms present in the cluster. Representative sequences were selected from each cluster, in descending order of priority: 1) Sn15 sequence with closest to average length, 2) presence in LDN03-Sn4, 3) Sn2000, 3) Sn79, 4) random selection from closest to average length (Supplementary Data 5). CDS sequences of clusters were codon-aligned with DECIPHER v2.16.1118, gene trees were estimated with FastTree v2.1.11139, tested for positive selection with HYPHY v2.5.15140,141 (BUSTED method, p-value ≤ 0.01).

Functional analysis & effector candidate prediction

MetaEuk (v4, easypredict)142 was used to search for SnToxA (SNOO_16571A), SnTox1 (SNOO_20078A), SnTox3 (SNOO_08981A), SnTox267 (SNOO_14493A) and SnTox5 (SNOO_50320), and effector protein isoform profiles for each isolate were extracted matching regions. Functional annotation was performed versus the representative ortholog clusters with InterProScan143,144 with additional GO-terms added with PANNZER145 and eggNOG-Mapper146 (excluding “anti-slim”). Additional annotations, properties, and effector-likelihood were added with Predector v.0.1.0147 (Supplementary Data 7). Candidate secreted effector-like proteins (CSEPs) were predicted by filtering ortholog clusters for the criteria: predicted secretion, ≥2 cysteine residues, Predector score ≥2, present in ≥1 reference isolate, excluding singletons, and ≥1 functional annotation (Supplementary Data 8).

Statistics and reproducibility

Functional enrichment tests were performed using the total set of ortholog groups (n = 34,381). Fisher’s exact test (two-tailed) was applied to orthogroup counts assigned to individual functional annotations, comparing orthogroups belonging to sub-categories ‘accessory’ (n = 11,470) and ‘singleton’ (n = 8861) with those of the ‘core’ sub-category (n = 14,050). Functional annotations with p ≤ 0.05 are reported in Supplementary Data 6.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.