Introduction

Rice (Oryza sativa and O. glaberrima) is grown in over 100 countries as a staple food for an estimated 3.5 billion people1. The wild Oryza species represent largely untapped reservoirs of novel genetic material that can be used to improve agronomically-important traits in cultivated rice2,3. There are four major wild relatives of rice existing in Australia; Oryza australiensis (EE genome), O. meridionalis (AA genome), O. rufipogon (AA genome), and O. officinalis (CC genome)4. Of these, the diploid O. australiensis (2n = 24) is the most distantly related, having diverged from the progenitor of cultivated rice more than 7 million years ago5. This species possesses a number of unique phenotypic characteristics that make it a potential candidate for rice improvement4. Genotypic and morphological tolerance traits related to drought6, heat7, and salinity8 have enabled the species to survive perennially within the coastal and inland regions of northern Australia. Oryza australiensis also possesses resistance to a number of globally important pests and diseases of rice, including the brown planthopper insect9, bacterial blight (BB) 10, and rice blast disease11.

The assembly and annotation of wild Oryza reference genomes have revealed insights into the evolutionary history of the genus and enabled the discovery of novel gene targets for rice breeding programs2,12. A number of gene variants identified in wild rices have been found to confer advantageous traits, such as variations in the BB resistance gene Xa27 from O. minuta13 and in the heat tolerance gene HTH5 from O. rufipogon14. The O. australiensis genome is more than twice the size (~ 900 Mb) of cultivated rice due to a rapid expansion of LTR retrotransposon families, and this has impeded the generation of a reference genome assembly until recently12,15,16,17. As genome sequencing and assembly technologies continue to advance, an updated, high-quality de novo reference for the O. australiensis genome will further elucidate genetic insights from this species. Here, we assemble a chromosome-scale, haplotype-resolved de novo assembly for O. australiensis using PacBio long reads and HiC-integrated sequencing data. We aimed to explore the potential of O. australiensis as a candidate source of novel genetic material by identifying species-specific genes and protein clusters that are absent in cultivated rice. We also investigated the structural variations across haplotypes, accessions, and species to understand intra- and inter-species diversity. We identified some of the genomic regions comprising potential disease resistance gene loci and located gene homologues to cloned resistance genes in rice, highlighting O. australiensis as a source for novel disease resistance.

Results

Generation of a de novo, chromosome-level O. australiensis genome assembly

The two PacBio Sequel II SMRT cells used for long-read sequencing of Oryza australiensis produced HiFi yields of 36.3 Gb (40X) and 36.5 Gb (40X) with a Q35 median read quality and a total combined genome coverage of 80X. Assembly of the HiFi reads was conducted with the Hifiasm assembler18 using paired-end Hi-C read integration to produce a primary contig assembly. The primary assembly yielded a total of 1452 contigs, with an N50 of 72.6 Mb and a total length of 975 Mb. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis with universal single copy genes within the viridiplantae lineage identified a genome coverage of 99.3%.

The primary assembly was scaffolded using SALSA219 with Hi-C proximity ligation libraries. D-Genies20 was used to assess alignment of the scaffolded O. australiensis pseudomolecules with the Oryza sativa reference genome. There were 14 major (> 0.5 Mb) O. australiensis scaffolds that aligned with the 12 O. sativa chromosomes. Ten of these scaffolds aligned wholly with a chromosome in O. sativa and possessed telomeric repeats on both ends. Ragtag21 joined three scaffolds corresponding to O. sativa chromosome 5, resulting in a telomere-to-telomere pseudochromosome 5 of length 79.2 Mb. RagTag did not join the two scaffolds corresponding to chromosome 9.

Eleven out of 12 scaffolds in the finalised assembly possessed telomeres at both ends, representing complete ‘telomere-to-telomere’ (T2T) pseudochromosomes. Pseudochromosome lengths ranged from 49.4 Mb (Oaus_09) to 99.3 Mb (Oaus_02) (Supplementary table 1). The final length of this assembly was 909 Mb with a BUSCO completeness of 99.3% (Table 1). The genome lengths estimated by flow cytometry for O. sativa ssp. japonica cv. Nipponbare and O. australiensis were 391 Mb at 1C (± 1.69% CV) and 909 Mb at 1C (± 0.80%CV), respectively.

Table 1 Primary and haplotype genome assembly and annotation statistics.

Resolution of O. australiensis haplotype assemblies

Assembly of the HiFi reads using the Hifiasm assembler18 generated a pair of haplotype-resolved assemblies, designated haplotype 1 (hap1) and haplotype 2 (hap2). The hap1 contig assembly had a total of 1,254 contigs, an N50 of 72.4 Mb and a total length of 962 Mb. The hap2 assembly produced 678 contigs, an N50 of 69.6 Mb and a total length of 939 Mb. Scaffolding with SALSA2 and RagTag joined two and four contigs in the hap1 and hap2 assemblies, respectively. There were two major contigs in each of the haplotype assemblies corresponding with chromosome 5 in the collapsed assembly, however these were not joined during scaffolding (Supplementary Fig. 1). Scaffolds from the hap1 assembly that corresponded to pseudochromosomes in the primary genome were designated OausHap1_01 to OausHap1_12. Remaining scaffolds with telomeric regions or scaffolds that appeared to align with the primary assembly but were not assigned as chromosomes were also included in the finalised hap1 assembly, which consisted of 16 total scaffolds (Table 1). This was repeated for the hap2 assembly, resulting in the selection of 18 final scaffolds. The BUSCO analyses for hap1 and hap2 assemblies reported a gene coverage of 99.0% and 99.2%, respectively. There were 10 T2T pseudochromosomes in the finalised hap1 assembly and nine T2T pseudochromosomes generated in the hap2 assembly (Supplementary table 1). The two haplotype assemblies were observed to be highly syntenic with each other (Fig. 1). Around 873.6 Mb (99.9%) of the hap1 chromosome-level assembly was predicted to consist of syntenic regions with the hap2 assembly. Unaligned regions accounted for less than 0.04% (396 kb) of the haplotype assembly lengths, whilst inversions and translocations corresponded to ~ 57 kb and ~ 33 kb of the total assembly lengths, respectively. Heterozygosity of the O. australiensis genome was predicted to be ~ 0.04% using k-mer-based statistical approaches.

Fig. 1
figure 1

Structural variation between the primary O. australiensis assembly (collapsed) and haplotypes (Hap1 & Hap2). Black vertical lines (|) indicate predicted telomeres across each assembly. Red circles (o) indicate predicted centromeres in the collapsed assembly.

Structural annotation and repeat element content analyses of O. australiensis

Approximately 689 Mb (75.9%) of the genome was designated as interspersed repeat content, of which 43.7% were LTR retrotransposons. The majority of repeat element content was designated into two classes of LTR retrotransposon—Ty1/Copia (10.6%) and Gypsy/DIRS1 (32.3%) retroelements. Around a quarter (24.3%) of transposable elements were unclassified. Further analysis with EDTA22 designated these as unknown LTR retrotransposons. The final LTR retrotransposon content estimated by EDTA was 58.2%. A further 11.3% of repeat content was designated as terminal inverted repeat (TIR) DNA transposons and 4.4% was classed as non-TIR rolling-circle (Helitron) transposons. Analysis of LTR elements using LTR_retriever23,24 demonstrated a high LTR Assembly Index (LAI) score of 21.51, indicating high assembly contiguity.

The primary O. australiensis assembly was annotated using viridiplantae protein data and RNA-seq data as input files. There were 23,929 putative genes and 26,529 transcripts predicted by Braker325. Of these 26,529 predicted transcripts, 23,388 (88%) were supported by the RNA-seq data. Analyses of the annotated protein sequences identified 98.2% of complete BUSCOs within the viridiplantae lineage. Annotation of the haplotype assemblies predicted 23,578 potential coding genes and 26,125 transcripts in hap1 and 23,653 potential genes and 26,182 transcripts in hap2 (Table 1). The gene densities and distributions of transposable elements across the O. australiensis chromosomes were visualised using shinyCircos V2.026 (Fig. 2). Gene density appeared to be highest towards the ends of chromosomes and seemed to coincide with DNA transposon distribution but was inversely distributed with retroelements.

Fig. 2
figure 2

Gene density and repeat element content in O. australiensis. (A) chromosome number, (B) Gene density, (C) Total retroelement density, (D) Gypsy retrotransposon density, (E) Copia retrotransposon density, (F) terminal inverted repeat (TIR) DNA transposon density, (G) Helitron density.

Intra- and inter-species structural variations in O. australiensis

There was extensive structural variation observed between the primary assembly and a previously published O. australiensis genome12 (Fig. 3). There were 4,607 translocations, 1,084 duplications, and 163 inversion events between the two assemblies, corresponding to 98.8 Mb or 10.9% of the primary O. australiensis genome. More than half of this structural variation was attributed to large inversions (> 1 Mb and > 1.25% total chromosome length), spanning a cumulative length of 47.7 Mb (5.2% total genome length) (Supplementary table 2). The largest inversion was 6.3 Mb (8.0% of total chromosome length) and found on chromosome 7, followed by an inversion of 5.6 Mb on chromosome 11 (9.0% of chromosome length). The largest translocation region in the primary assembly was 584 kb in length and on chromosome 12 in both assemblies. Chromosome lengths were relatively consistent between the current and previously published assemblies. The difference in chromosome lengths was < 2 Mb in nine of the twelve assembled chromosomes. The largest difference in length occurred on chromosome 11 (5.9 Mb length difference).

Fig. 3
figure 3

Structural variation between O. australiensis assembly (O_aust) and previously published genome12 (O_aust_Long2024). Arrows indicate large inversions (> 1.25% of chromosome length).

Structural variation was additionally compared between the O. australiensis chromosomes and a cultivated O. sativa ssp. japonica rice genome (Fig. 4). Despite being considerably evolutionarily diverged, the two genome assemblies demonstrated a high level of structural synteny. Around 526.9 Mb (58.0% total genome length) of the O. australiensis genome was predicted to be syntenic with cultivated rice. Syntenic regions between the two genomes were concentrated on the distal end of chromosomes, whilst large sections of unaligned or inverted regions tended to be distributed centrically. There were 159 inversions and 354 translocations between the primary O. australiensis genome and O. sativa. Thirteen inversions were larger than 1 Mb and greater than 1.25% of the chromosome length, comprising a total of 48.1 Mb (5.3%) of the O. australiensis genome. Nine of these inversions were common (position of inversion within 200 bp and < 3 kb inversion size difference in O. sativa) in both the current and previously published O. australiensis genomes, whilst the remaining four were unique to the current assembly (Tables 2 and 3). The largest inversion between O. australiensis and cultivated rice was 20.1 Mb (22.5% total chromosome length) on chromosome 6.

Fig. 4
figure 4

Structural variations between cultivated rice (O. sativa) (O_sat_Shang2023) and O. australiensis. Black arrows indicate large inversions (> 1.25% of chromosome length) that are common to both the current (O_aust) and previously published12 (O_aust_Long2024) assemblies. Red asterisks indicate large inversions unique to that assembly.

Table 2 Common inversions with O. sativa found in O. australiensis assemblies.
Table 3 Unique inversions with O. sativa found in O. australiensis assemblies.

Functional annotation of collapsed O. australiensis genome assembly

The primary O. australiensis coding sequences were functionally annotated. Of the 26,529 transcript sequences, 3,388 sequences had BLAST hits, 280 were further mapped to Gene Ontology (GO) terms, and 22,495 sequences further underwent GO annotation (Supplementary data 1). There were 366 (1.4%) sequences that did not yield any BLAST hits and were not GO annotated, designated as ‘no-BLAST’ sequences. The coding potentials of these no-BLAST sequences were assessed using Arabidopsis (0984) and Oryza (45,270) as reference models. There were 314 (85.8%) and 331 (90.4%) O. australiensis no-BLAST sequences predicted to have coding potential when analysed against the Arabidopsis and Oryza references, respectively.

There were 11,496 GO terms assigned to genes within the O. australiensis genome. GO terms associated with abiotic and biotic stress response were investigated in OmicsBox (OmicsBox—Bioinformatics Made Easy, BioBam Bioinformatics, March 3, 2019, https://www.biobam.com). There were 1,597 coding sequences that possessed GO annotations associated with abiotic stresses or factors. A large number of genes possessed GO functions associated with salt stress response (449), followed by response to cold (282). There were 1,665 coding sequences with GO annotations specifically associated with biotic interactions. The most common GO annotation function related to biotic stress was ‘defense response to bacteria’ (296) followed by ‘defense response to fungus’ (280).

Oryza australiensis possesses species-specific genes and protein clusters

Species-specific protein clusters in O. australiensis were identified using OrthoVenn327, which grouped the 23,929 O. australiensis protein sequences from the primary assembly into 18,241 homologous clusters. Of these, 18,070 (99.1%) clusters were orthologous with O. sativa and 171 were unique to O. australiensis. There were 87 unique, O. australiensis-specific clusters that contained three or more protein sequences. These were annotated and assessed for common GO annotation terms using OrthoVenn3 (Supplementary table 3). Forty-six of these 87 clusters remained unannotated and were not designated a GO function. The largest cluster, with 41 proteins, was annotated with the GO term ‘DNA recombination’ and matched to the Swiss-Prot hit ‘retrovirus-related Pol polyprotein from transposon’ (Q94HW2). Of the remaining nine largest protein clusters, two (both containing 12 proteins) were annotated with the GO term ‘DNA integration’ and similarly matched to the Swiss-Prot hit ‘retrovirus-related Pol polyprotein from transposon’ (P10978, P04323). One cluster, containing 19 proteins, was annotated with the GO term ‘nuclease activity’ (Swiss-Prot hit: Q9M2U3—Protein ALP1-like), and the remaining six largest clusters remained unannotated.

O. sativa ssp. japonica cv. Nipponbare Illumina reads were mapped to the O. australiensis predicted gene set to identify potential species-specific genes. There were 1,431 (6.0%) O. australiensis genes that could not be mapped with O. sativa reads. These were designated as ‘unmapped’, species-specific genes. Of these genes, 968 (67.7%) were GO-annotated in OmicsBox, 374 (26.1%) and 46 (3.2%) received BLAST hits or GO mapping, respectively, but no GO annotation, and 43 (3%) remained without BLAST hits, GO mapping, or annotation. These 43 genes underwent coding potential analyses, of which 34 (79.1%) and 41 (95.3%) genes were classified as coding genes when compared against the Arabidopsis and Oryza models, respectively. There were 102 unmapped gene sequences found in O. australiensis that possessed functions related to abiotic and biotic stress response. From these sequences, the most frequently occurring GO term was ‘response to other organism’ (GO:0051707), which was found in 60 of the unmapped coding sequences, followed by ‘defense response to other organism’ (43 genes) (GO:0098542). Gene enrichment analysis of the unmapped gene subset was conducted using a two-tailed Fisher’s Exact Test. There were twelve GO functions identified to be overrepresented in the unmapped O. australiensis genes (Table 4). Enriched GO functions related to biological processes included DNA integration (GO:0015074) (p < .0001), nucleosome assembly (GO:0006334) (p < .0001), calcium-mediated signalling (GO:0019722) (p = .0003), response to auxin (GO:0009733) (p = .0004), and cellular response to potassium ion starvation (GO:0051365) (p = .0006). Molecular functions that were overrepresented were related to protein heterodimerization activity (GO:0046982) (p < .0001), molecular tag activity (GO:0141047) (p = .0009), protein tag activity (GO:0031386) (p = .0009), and structural constituent of chromatin (GO:0030527) (p < .0001). There were also three enriched GO terms related to cellular components: nucleosome (GO:0000786) (p < .0001), protein-DNA complex (GO:0032993) (p < .0001), and extracellular regions (GO:0005576) (p < .0001).

Table 4 Overrepresented GO functions in the unmapped, species-specific O. australiensis gene set.

The O. australiensis genome possesses homologues to cloned disease resistance genes

Two previously published datasets, one containing experimentally-verified plant nucleotide-binding leucine-rich repeat (NLR) disease resistance genes28 (RefPlantNLR dataset) and one containing rice-specific cloned disease resistance genes29 (CDRH dataset) were used for identifying disease resistance homologues in O. australiensis. Seventy-nine (79) genes from the O. australiensis genome shared homology with 52 of the 481 protein sequences in the RefPlantNLR database, which were clustered into a total of 45 shared orthologous groups (Supplementary data 2). Within the 52 RefPlantNLR resistance genes identified to have homologues in O. australiensis, there were two (4%) leaf rust (Lr) resistance genes (Lr13—g4425, Lr21—g22628), three (6%) stripe rust (Yr) resistance genes (Yr-10—g22403, YrAS2388R—g19116, YrU1-RGA4—g13087), and one bacterial blight (Xo1) resistance gene (g11308).

In comparisons against the rice-specific disease resistance gene set, there were 41 O. australiensis genes identified as homologues to 50 of the CDRH sequences, resulting in 39 orthologous clusters between the two datasets (Table 5).

Table 5 Cloned disease resistance gene homologues (CDRHs) identified in O. australiensis.

Of these 41 genes, 11 were homologues to multiple cloned disease resistance genes, four were homologues to only bph resistance genes, 14 were homologues to Pi resistance genes, four were homologues to virus resistance genes, and eight were homologues to Xa resistance genes. One gene, g18778, located on O. australiensis chromosome 8, was clustered into an orthologue group with five cloned disease resistance genes (Pi56, Pi5, Pish, Pi37, BPH14). There were 10 genes in O. australiensis that were distributed on different chromosomes to their homologous counterpart in O. sativa (g18778, g5963, g5428, g18763, g22620, g20366, g22575, g23592, g267, g23146) (Fig. 5). O. australiensis chromosome 11 possessed the highest number of CDRHs (15 genes), followed by chromosome 4 (6 genes), chromosome 12 (5 genes), and chromosome 8 (3 genes). Gene homologues were predominantly concentrated around the distal chromosomal regions.

Fig. 5
figure 5

Distribution of CDRHs in O. australiensis (O_aust) and corresponding position in O. sativa (O_sat_Shang2023). & = indicates that one gene in O. australiensis is homologous to more than one cloned disease resistance gene. # = gene positions from left to right: xa13, BPH18&Pib, BPH14&Pi37&Pish&Pi5&Pi56. ## = gene positions from left to right: Xa46, Xa10&Xa23, Pik&Xa47, Pi54, Pb1, Pi36&RYMV3, Pit&Pb3, Pb2, Xa4, Pi65. +  = gene positions from left to right: Xa21, Xa46, Xa10, Xa23, Pb1. +  +  = gene positions from left to right: Pi54, RYMV3, Pb3, Pb2, Pik, Xa4, Pi65, Xa3.

Prediction of disease resistance gene candidates in O. australiensis

The O. australiensis unmasked genome assembly and coding sequence files were processed through the NLR-Annotator gene prediction tool to identify potential NLR gene candidates (Table 6). Within the O. australiensis genome assembly there were 271 regions predicted to contain NLR motifs. In the O. sativa genome assembly, 517 sequences with NLR motifs were identified. In the coding sequence files, 132 and 464 potential NLR genes were predicted in O. australiensis and O. sativa, respectively. Chromosome 11 in O. australiensis possessed the greatest number of potential NLR loci (33 gene candidates) whilst chromosome two possessed the least (five candidates).

Table 6 Summaries of number of disease resistance gene candidates identified in O. australiensis by NLR-Annotator and RGAugury.

Potential RGAs were identified from annotated O. australiensis amino acid sequences using RGAugury (Table 6). There were 161 non-redundant NBS-encoding disease resistance gene candidates. Of these, 61 candidates possessed coiled-coil (CC), NBS, and LRR domains together (CNL-type), 56 possessed only NBS and LRR domains (NL-type), 18 possessed a CC and NBS domain, but no LRR (CN-type), and 19 consisted of solely an NBS domain (NBS-type). One NBS-LRR gene possessed a Resistance to Powdery Mildew 8 (RPW8) on the N-terminal domain (RNL-type). There were no genes of the TNL type (containing Toll/Interleukin-1 receptor (TIR), NBS, and LRR domains) identified in O. australiensis, however there were three genes containing a TIR and NBS domain (TN-type) and three containing a TIR and unknown domain (TX-type). In O. sativa, RGAugury predicted 556 NBS-encoding disease resistance gene candidates, of which 447 contained both NBS and LRR domains. There were 647 non-NBS-encoding disease resistance gene candidates identified in O. australiensis, of which 499 were RLKs, 39 were RLPs, 107 possessed transmembrane (TM)-CC domains, and two possessed RPW8-like domains (Table 2). In O. sativa, RGAugury identified 1042 RLKs, 119 RLPs, 190 TM-CCs, and one RPW8-like gene candidate.

Duplicated genes were identified within the O. australiensis genome using MCScanX. Across the entire genome, there were 1,870 genes classified as tandem duplicated genes (TDGs), 671 were proximal duplications, and 7,453 were dispersed duplications. Of the RGA candidates identified by RGAugury, 131, 83, and 337 genes were predicted to be tandem, proximal, and dispersed duplicates, respectively. The 131 RGAs classified as TDGs formed 53 clusters of two or more genes. The most commonly occurring TDG RGA class was RLKs (80 TDGs in 31 clusters), followed by NBS-encoding (39 TDGs in 16 clusters), RLPs (6 TDGs in 3 clusters), TM-CCs (4 TDGs in 2 clusters), and RPW8 (2 TDGs in 1 cluster) (Supplementary table 4). Tandem duplications were typically distributed similarly to CDRHs, along the distal ends of chromosomes (Supplementary Fig. 2). Chromosomes 7 and 11 both had the highest number of clusters (seven clusters), however the clusters in chromosome 7 were larger (25 TDGs) compared to chromosome 11 (15 TDGs). Chromosomes 3 and 10 had the smallest number of TDGs (2 TDGs, 1 cluster).

Discussion

The wild relatives of crops are potential sources of novel genes that can improve agronomic traits in cultivated species. Oryza australiensis possesses characteristics that make it a promising source of abiotic and biotic stress resistance for rice improvement. O. australiensis has the largest genome within the diploid Oryza group owing to a rapid expansion of LTR retrotransposable elements, which have hindered the assembly and exploration of its genome until recent years12,15,16,17. This paper reports a high-quality de novo assembly of a near-complete, chromosome-level genome for O. australiensis. Genome sequencing coverage was 80X, resulting in a highly contiguous assembly with a contig N50 of 72.6 Mb, which is higher than previously published assemblies12,15,17. Eleven of the scaffolds from this assembly possessed telomeric repeats on both the forward and reverse ends of the scaffold, representing complete, ‘telomere-to-telomere’ pseudochromosomes. Ten of these O. australiensis pseudochromosomes were complete at the contig-level after initial assembly with the Hifiasm assembler. Overall, eleven of the twelve O. australiensis pseudochromosomes were complete post-scaffolding, all 24 telomeric ends of the pseudochromosomes were identified, and the final BUSCO score of the collapsed assembly was 99.3%. This represents a remarkably high-quality assembly of the O. australiensis genome. The final estimated size of the O. australiensis genome assembly was 909 Mb, which aligned exactly with the genome size estimated by flow cytometry. Previous length estimates of the O. australiensis genome have ranged from 85815 to 1056 Mb30.

In addition to a collapsed genome assembly, two phased de novo haplotype assemblies were also assembled to chromosome-level. The haplotype assemblies exhibited similar BUSCO completeness, number of genes, repeat element content, and genome length to the collapsed assembly, with the exception of chromosome 5, where scaffolding tools joined two contigs to create a longer chromosome in the collapsed assembly, but not in the haplotype assemblies. Analyses of the assemblies revealed high structural synteny and low heterozygosity (~ 0.04%) between the haplotypes, suggesting genetic homogeneity within O. australiensis populations. Interestingly, comparisons against a previously published O. australiensis genome12 identified a large number of inversions and translocations between the previous and current assemblies, suggesting a high level of genetic diversity across populations. These findings corroborate previous suggestions that genetic distance between O. australiensis populations is linked to geographic distance15. Extensive structural variation has also been reported between accessions of wild relatives of other crops, such as barley and cucumber31,32,33,34,35. These variations, and specifically chromosomal inversions, are suggested to facilitate adaptive trait selection to abiotic and biotic stresses32,36,37. Inversions between plant populations additionally play a role in suppressing recombination, resulting in the preservation of favourable allele combinations 38 It is therefore possible that O. australiensis—a widely distributed species found across a large geographic region in Australia—possesses extensive intra-species variation to enable and maintain local adaptation of populations to heterogenous environments. Additional work is needed to confirm that these variations are indeed biological and not a result of assembly artifacts. However, the high assembly quality and contiguity in both the current and previously published genomes strongly support the suggestion that these differences are real. As previously discussed in other studies, these findings also demonstrate the difficulty of capturing the extent of diversity within wild crop relatives using single reference genomes32,35.

Despite an estimated 7 million years of evolutionary divergence, the O. australiensis assembly showed remarkable synteny with the O. sativa genome5,15. Several large inversions (> 1.25% of total chromosome length) were observed to comprise 7.3% of the O. australiensis genome. The largest inversion between O. australiensis and O. sativa was found on chromosome 6 and comprised 22.5% (20.1 Mb) of the total chromosome length in O. australiensis. This inversion was common to both the current and previously published O. australiensis genomes. Similar large inversions occurring in the centric region of chromosome 6 have also previously been reported in other wild Oryza species12,39.

Analysis of the O. australiensis repeat content revealed that approximately 76% of the genome consists of repeat elements, of which around 58% are LTR retrotransposons. In other studies, the O. australiensis total repeat and LTR retrotransposon content have similarly been estimated to be around 74–76% and ~ 60%, respectively12,15,16,40. The O. australiensis repeat element content exceeds that of the other wild Australian rice species, which are estimated to be 42% for O. rufipogon, 27% for O. meridionalis, and 65% for O. officinalis2,40. The most commonly occurring transposable elements belonged to the Gypsy (32.3%) and Copia (10.6%) retrotransposon families. Previous studies suggest that these repeat elements are likely to be primarily comprised of the Kangourou and Wallabi (Gypsy-type) and RIRE1 (Copia-type) LTR retrotransposons16,41.

There were 23,929 genes predicted in the O. australiensis genome, supported by a BUSCO completeness of 98.2%. Whilst this higher BUSCO score supports our gene predictions, this gene number was lower than previous estimates for O. australiensis12,15, and was also lower than those for O. sativa cultivated varieties, which range between 35,500 and 38,500 genes2. However, our findings support previous suggestions that the large size of the O. australiensis genome is attributed to the expansion of retroelements, as opposed to gene duplications15,16. Moreso, these findings suggest that the expansion of transposable elements and movement of retrotransposons within or near genes may have contributed to gene disruption, resulting in a decrease in functional genes within O. australiensis42. This is plausible given that transposable element activity is often deleterious and can induce strong loss-of-function mutations43,44.

Mapping of O. sativa data to the O. australiensis genome identified 1,431 coding sequences and 171 protein clusters that were specific to O. australiensis and not found in cultivated rice. Of the 1,431 species-specific transcripts, 43 did not yield any BLAST or GO annotation results. Coding potential analyses showed that between 79% and 95% of these transcripts possessed coding potential, indicating that these could encode for novel genes that have not yet been identified in cultivated rice or other species. Genes and protein clusters specific to O. australiensis were found to be enriched in functions associated with DNA integration and retrotransposon activity, which coincides with the high transposable element content observed in this species16. Species-specific O. australiensis genes were also enriched in functions related to calcium-mediated signalling, which plays an important role in both abiotic and immune response signalling45,46, as well as auxin response, which plays a role in plant growth but has also been linked to abiotic salt stress response in Arabidopsis47.

Homologues to genes related to rice blast, bacterial blight, and BPH resistance were identified in the O. australiensis genome. Similar to other Oryza species2, a large proportion of the O. australiensis disease resistance genes were positionally clustered, with the highest number of genes observed on the distal end of chromosome 11. In cultivated rice, major clusters of leaf blast (Pi) resistance genes have been identified on chromosomes 6, 11, and 12, whilst chromosomes 3 and 7 possess the least number of Pi resistance genes48,49. Similarly in O. australiensis, several homologues to known Pi genes were located on chromosomes 6, 12, and the distal ends of chromosome 11, and no Pi-like genes were identified on chromosomes 3 and 7. In cultivated rice, the majority of brown planthopper (bph) resistance genes are located on chromosomes 3, 4, 6, and 1250. In O. australiensis, bph homologues were found on all of these chromosomes. There were bacterial blight (Xa) resistance homologues observed to cluster in similar regions to the Pi genes on chromosome 11, similar to observations made in cultivated rice51.

There were between 118 and 132 nucleotide-binding leucine-rich repeat (NLR) disease resistance gene candidates identified within the O. australiensis genome, similar to previous estimates of 159 NLRs 12. There were no NLR genes of the TIR-NBS-LRR (TNL) type found in O. australiensis, supporting previous suggestions that the TNL gene type was discarded during evolution of the monocot lineage52,53,54,55,56. Cultivated rice O. sativa possesses a higher number of NLRs, with previous estimates ranging between 374 and 535 genes, depending on cultivar and sub-species2,55,57,58,59. Similarly, we predicted between 447 and 464 NLR genes within the O. sativa genome. Our results support previous suggestions that wild rices possess fewer NLR genes compared to cultivated rices as a result of artificial selection for pathogen resistance during domestication2,12. Despite having a lower total number of genes, O. australiensis possessed homologues to almost all of the current cloned genes conferring resistance to the most globally important rice diseases. Allelic variations within known resistance loci can be sources of novel or additional disease resistance, for example, several variants of the Pi9 blast resistance locus have been utilised across a number of rice cultivars29,60,61,62. Moreso, in O. australiensis, a major QTL Pi40 was also identified at the Pi9 locus and introgressed into cultivated rice to confer broad spectrum resistance to rice blast in Korea and Türkiye11,63. Consequently, the homologues identified in O. australiensis represent real candidates for disease resistance improvement in cultivated rice.

Conclusion

In this study we explored the genome of O. australiensis, a crop wild relative of rice. Despite 7 million years of evolutionary divergence and possessing a markedly larger genome, O. australiensis remains highly syntenic with cultivated rice, which has likely facilitated the introgression of agronomically valuable genes from the species previously11,64. There is substantial structural variation between O. australiensis accessions, alluding to local environmental adaptation between populations, yet very low heterozygosity between haplotypes. This study provides new insights into the genetic diversity within this species, and supports the potential of O. australiensis to be a candidate source for novel abiotic and biotic stress resistance genes.

Materials and methods

Sample collection, DNA and RNA extractions, and sequencing

Oryza australiensis germplasm was sourced from the Australian Grains Genebank (AusTCRF#300,134, seed pack #593). Young fresh leaves from an Oryza australiensis plant (ID: Oaus_01) were collected from a glasshouse at The University of Queensland (UQ) Gatton campus. Total genomic DNA was extracted from pulverized leaf tissues using a CTAB (Cetyltrimethyl ammonium bromide) DNA extraction protocol65 with the following amendments. Experimental steps involving phenol were replaced with 100% chloroform. Extracted DNA was dissolved and stored in 10 mM TrisHCl buffer. RNase A (Qiagen, 19,101) was added to a final concentration of 4 ng/uL after extraction and dissolving of DNA. Quality analyses of DNA extraction were conducted using spectrophotometer and gel electrophoresis methods as described in the above protocol65. High-fidelity (HiFi) reads were generated at the Australian Genome Research Facility (AGRF), UQ, using Pacific Biosciences (PacBio) circular consensus sequencing methods with two PacBio Sequel II SMRT cells. For the RNA-seq data, RNA was extracted from O. australiensis leaf tissue and preparation of RNA libraries was conducted by the Ramaciotti Centre for Genomics using the TruSeq Stranded Total RNA with Ribo-Zero plant kit. Sequencing was conducted at the Ramaciotti Centre for Genomics, University of New South Wales, Australia, using a NextSeq 500 system and 300-cycle MID output kit. For Hi-C sequencing, fresh young leaves were collected from an O. australiensis plant grown in glasshouse conditions at the Arizona Genomics Institute, the University of Arizona, USA. Hi-C library preparation and sequencing was performed by Dovetail Genomics, California, USA. The Dovetail Hi-C libraries were QC’d by sequencing ~ 1-2 M PE75bp reads using a MiSeq instrument and mapping the data back to the de novo assembly.

Genome length estimation by flow cytometry

Mature Oryza sativa ssp. japonica cv. Nipponbare, Oryza australiensis, and Macadamia tetraphylla were used for cytometric analyses. Mechanical dissociation was used to prepare nuclear suspensions as previously described66, with modifications for woody plant species. Briefly, young leaves were freshly harvested and then chopped into ice-cold 0.5 mL of Arumuganathan and Earle67 nuclear isolation buffer in a 5 cm polystyrene Petri dish. To estimate nuclear DNA content for rice, 40 mg of the internal standard M. tetraphylla was co-chopped with 15 mg of rice for approximately 7–8 min. Resulting homogenates were gently filtered through a pre-soaked 40-µm nylon mesh into a 5 mL round bottom polystyrene tube. Homogenates were then stained with 50 µg/mL of propidium iodide (PI) (Sigma, P4864-10ML) and 50 µg/ml of RNase A (Qiagen, 19,101) for 10 min on ice.

The BD Biosciences LSR II Flow Cytometer and FlowJo software package was used to analyse the homogenates. Fluorescence was collected using a 488 nm excitation laser tuned to 514.4 nm and a 610/20 nm bandpass filter. Instrument settings were kept constant across experiments: forward scatter voltage at 200, side scatter voltage at 350, fluorescence intensity voltage at 500, with a slow flow rate (20–50 events/s). Three biological replicates were performed on three different days. For each biological replicate, a minimum of 1,500 PI-stained events were collected per PI-stained peak. The %CV for the PI-stained peaks ranged between 2.2 and 4.3, with a mean of 3.0. Nuclear DNA content was calculated as previously described68 using 796 Mb at 1C for the assumed size of M. tetraphylla.

Genome assembly

Hi-C sequencing data from Dovetail Genomics, California, USA, was integrated with PacBio HiFi reads using the Hifiasm Denovo assembler v0.19.4 (r575)18 to create two de novo contig haplotype assemblies and a collapsed, primary assembly. Genome assembly completeness was assessed against the 425 Benchmarking Universal Single-Copy Orthologues (BUSCO v5.4.6) within the viridiplantae lineage, using the Galaxy AUGUSTUS program workflow69,70. Genome contiguity was assessed with the quality assessment tool for genome assemblies (QUAST v5.2.0)71. Contigs were mapped with Hi-C proximity ligation libraries using Chromap v0.2.572 and scaffolded using the SALSA2 scaffolding tool19. Additional scaffolds were joined with the RagTag scaffolding tool v2.1.021, using Oryza sativa (Osativa323v7 Phytozome) as a reference (https://phytozome-next.jgi.doe.gov/). Resulting scaffolds were assigned to corresponding O. sativa chromosomes. Lengths of pseudochromosomes were estimated using the seqtk toolkit73. Telomeric repeat regions were characterised in pseudochromosomes using the Telomere Identification Toolkit (tidk) (https://github.com/tolkit/telomeric-identifier) and the plant telomeric repeat ‘AAACCCT’. Centromeric regions were predicted using CentroMiner within quarTeT74. K-mer analysis was conducted on the O. australiensis genome using Jellyfish (v2.3.1)75 and GenomeScope2.076 using a k-mer size of 21.

Structural annotation

Structural annotation was performed on the finalised O. australiensis collapsed and haplotype assemblies. Transposable element repeats within the O. australiensis genome were identified and masked using RepeatModeler2 v2.0.577 and RepeatMasker v4.1.578. Repeat element content within the O. australiensis genome was assessed with RepeatMasker v4.1.5 using the soft masking option. Paired-end RNA-seq files were adaptor-trimmed using TrimGalore! v0.6.1079 and aligned to genomes using HISAT280. Gene prediction was performed with Braker3 v3.0.325 using RNA-seq data and viridiplantae protein databases as support inputs. Subread81 and featureCounts82 were used to assess the proportion of genes with RNA-seq support. BUSCO analysis using the viridiplantae lineage was used to predict genome completeness of the predicted protein sequences. The LTR Assembly Index (LAI) score for assessing assembly contiguity was calculated using LTR_retriever23,24.

Density distributions of predicted genes and repeat element content was visualised with shinyCircos V2.026. The repeat element data used for density visualisations were generated using EDTA v2.2.0 (parameters: -annoc 1)22.

Structural synteny and genome visualisation

O. australiensis chromosomes were compared with those from O. sativa (NCBI: PRJNA95366383 and Osativa323v7 from Phytozome), as well as those from a previously published O. australiensis assembly12 using D-Genies v1.520 and Syri84 with Minimap2 v2.2885 to investigate intra- and inter-species structural synteny. Inversions between assemblies were designated as ‘large’ if inversion length exceeded 1.25% of the total chromosome length34.

Functional annotation

Functional annotation of the O. australiensis CDS annotation file was performed in OmicsBox 3.1.11 (OmicsBox—Bioinformatics Made Easy, BioBam Bioinformatics, March 3, 2019, https://www.biobam.com) using the following workflow. Sequences were searched with the Basic Local Alignment Search Tool (BLAST) against a non-redundant protein sequence database using BLASTX with viridiplantae taxonomy and an e-value threshold of 1.0E−10. Sequences were run through InterProScan86 and Gene Ontology (GO) terms were retrieved from BLAST hits using Gene Ontology mapping with Blast2GO87 annotation. Blast2GO annotations were then combined with GO terms produced from InterProScan. Sequences with no BLAST hits were run through the Coding Potential Analysis Test (CPAT)88 in OmicsBox 3.1.11 to search for potential coding genes. CPAT was run using the prebuilt Arabidopsis thaliana model and with a Oryza specific model (organism: 45270 Oryza).

Prediction of candidate unique coding sequences and protein clusters in O. australiensis

O. sativa ssp. japonica Illumina read data was extracted from the National Center for Biotechnology Information (NCBI) (SRR15967546 & SRR15967547) and mapped against the O. australiensis gene list to identify candidate unique coding sequences. Data was mapped using the Map Reads to Reference tool (length fraction: 0.9 and similarity fraction: 0.9) on the QIAGEN CLC Genomics Workbench 24.0.2 (QIAGEN, Aarhus, Denmark).

O. australiensis genes that were not mapped with any O. sativa Illimuna reads were designated as ‘unmapped’ genes and were functionally annotated with OmicsBox 3.1.11. Sequences that had no BLAST hits were then assessed for coding potential using the CPAT analysis described above. OmicsBox was used with FatiGO89 to conduct a two-tailed Fisher’s Exact test to assess for enriched GO functions in the unmapped genes compared to the overall O. australiensis gene set. Adjusted p-values were calculated within the FatiGO pipeline using the Benjamini and Hochberg false discovery rate method89,90. The threshold for determining differentially represented GO terms was determined using a p-value cut-off of .05.

The protein sequence data for both O. australiensis and O. sativa (Osativa323v7 protein file Phytozome) were filtered for the longest isomer and then analysed for orthologous and unique protein clusters within the O. australiensis genome using OrthoVenn3 (parameters: OrthoFinder algorithm, E-value: 1e−2, Inflation value:1.50)27,91. GO enrichment analyses and annotation for unique O. australiensis clusters were automatically run through the Orthovenn3 platform.

Identification of cloned disease resistance gene orthologues in O. australiensis

Cloned disease resistance gene sequences in rice compiled in a previous paper29 were searched and downloaded from the Rice Annotation Project (RAP)92 and MSU Rice Genome Annotation Project databases93,94 (Supplementary table 5). This compiled list of genes was designated as the cloned disease resistance homologue (CDRH) list. An additional database, RefPlantNLR, containing 481 amino acid sequences corresponding to experimentally-verified plant NLRs within a number of important plant species was also retrieved28. The RefPlantNLR database and compiled CDRH gene lists were searched for orthologues within the O. australiensis genome using OrthoFinder v2.5.5 with default parameters91.

Identification of potential resistance gene analogues in O. australiensis

The collapsed O. australiensis genome was run through two tools, NLR-Annotator95 and RGAugury55, to identify disease resistance gene candidates. NLR-Annotator annotates potential NLR loci by identifying amino acid motifs associated with NLRs. NLR-Annotator was run against both the O. australiensis unmasked assembly and the annotated CDS sequence file, as well as against the O. sativa (Osativa323v7 protein file Phytozome) unmasked genome assembly and CDS file. The annotated O. australiensis amino acid sequence file was run through RGAugury, which identifies resistance gene analogues (RGAs), including NLRs, receptor-like kinases (RLKs), and receptor-like proteins (RLPs) using the programs BLAST96, nCoils97, pfam_scan98, Phobius99, and InterProScan86.

Identification of tandem duplicated resistance genes

Tandem duplicated genes were identified with MCScanX100 using the O. australiensis longest isomer protein sequence file and corresponding GFF file. The O. australiensis protein sequence file was blasted against itself to identify homologous genes using BLASTp96 within DIAMOND (v2.1.7.161)101. The following parameters were specified: –more-sensitive, –max-target-seqs 5, –evalue 1e−45, –query-cover 70. The BLAST output and the O. australiensis GFF file were run through duplicate_gene_classifier within MCScanX100. RGA candidates identified by RGAugury that were classified as duplicated gene pairs were extracted and manually designated as tandem duplicates if they belonged to the same RGA gene family and satisfied one or both of the following criteria: 1) were within 100 kb of each other, or 2) were adjacent with no intervening non-homologous genes in between them.