Main

The genetic bottleneck imposed by thousands of years of domestication has inevitably impoverished the rice crop genetic diversity needed to adapt to the changing environment1,2,3. New solutions are needed to overcome current and future challenges in rice production and sustainability, in light of a predicted expansion of the human population to 10 billion by 2050 (ref. 4). To help reduce this bottleneck, we are exploring and exploiting the standing genetic diversity of the genus Oryza. The genus includes Asian and African cultivated rice and 25 wild species (that is, 15 diploid genomes with 2n = 2x = 24 chromosomes and ten allotetraploid genomes with 2n = 4x = 48 chromosomes) spanning 11 extant genome types (AA, BB, CC, BBCC, CCDD, EE, FF, GG, HHJJ, HHKK and KKLL) defined on the basis of cytogenetic5 and/or molecular and phylogenetic evidence6. The Oryza species collectively encompass ~15 million years (Myr) of evolutionary history6,7,8 and represent a crucial resource for tolerance and resistance traits that could be harnessed for crop improvement and/or serve as the raw material for neodomestication9,10.

Since the late 1990s, genetic and genomic characterizations of the genus Oryza have advanced from single gene trait discovery and cloning11,12,13 to restriction fragment length polymorphism14 and physical mapping15, the first reference genome sequences of rice16,17,18 and a set of chromosome-level reference genomes representing the 15 distinct subpopulations of Asian rice19. The release of a chromosome-level reference genome of Oryza alta and an optimized protocol for neodomestication of this species through editing of key domestication genes has led to a new era in improvement of polyploid cereal crops20. Next steps include the generation of a complete digital GeneBank for cultivated rice as well as a set of ultra-high-quality reference genomes of the wild relatives of rice, including all Oryza tetraploid species21.

Here, we report the generation and interrogation of 11 chromosome-level reference genomes, from nine underutilized wild tetraploid Oryza species22,23 (Oryza malampuzhaensis (BBCC), Oryza minuta (BBCC), O. alta (CCDD), Oryza grandiglumis (CCDD), Oryza latifolia (CCDD), Oryza coarctata (KKLL), Oryza schlechteri (KKLL[HHKK]), Oryza longiglumis (HHJJ) and Oryza ridleyi (HHJJ)) and two wild diploid species (Oryza australiensis (EE) and Oryza meyeriana (GG)), using PacBio long-read sequencing technology and Bionano optical validation mapping. This dataset, combined with ten previously published diploid reference genomes, was used to describe how genome size and composition have evolved across the species in the genus, showing that some species are more malleable than others. The role of transposable elements (TEs) in shaping genome size was particularly evident in the ridleyi complex (O. ridleyi and O. longiglumis, both HHJJ), in which the differential expansion of a collection of TEs produced a striking size variation of the homoeologous subgenomes. We revisited a previous phylogenetic analysis of the Oryza genus24 by adding new evidence of the relationships among subgenomes. Based on our molecular and phylogenetic evidence, which demonstrates high similarity between the O. schlechteri and O. coarctata genomes, we propose changing the genome type designation of O. schlechteri from HHKK to KKLL. We performed synteny analysis at both macro and micro scales to define major chromosomal rearrangements and gene presence/absence variation in the wild species with respect to the AA genomes, explored the extent of gene fractionation25 in the subgenomes after polyploidization, and investigated the phenomena of subgenome dominance and equivalence in O. coarctata.

The release of 11 new chromosome-level reference genomes, in combination with previously published genome assemblies, represents a comprehensive and unique opportunity for future research in the fields of evolutionary biology, functional genomics, population genetics and biodiversity conservation of Oryza and could provide a robust instrument for neodomestication of climate-adapted rice crops26,27,28.

Results

The wild Oryza genome data package

We generated chromosome-level reference assemblies for nine tetraploid and two diploid wild Oryza species (Extended Data Table 1 and Supplementary Note 1) and extensively assessed them for quality and contiguity (with Bionano optical maps, genome assembly evaluation using BUSCO29 and the genome mapping rate) as described in Supplementary Note 2 and Supplementary Tables 13. These assemblies were annotated for genes and TEs using a uniform annotation pipeline (Extended Data Table 2, Supplementary Note 2 and Supplementary Tables 47). BioSample, BioProject, and Sequence Read Archive (SRA) accession numbers are provided in Supplementary Data 1, and code is available at https://github.com/alicefornasiero/IOMAP-3/tree/main.

TE dynamics in the ridleyi complex

Genome size across the Oryza genus varied ~3.4-fold, with the polyploids in the ridleyi complex being the largest (that is, the HHJJ genome species O. longiglumis (1,147 Mb) and O. ridleyi (1,203 Mb)) (Fig. 1 and Extended Data Table 2). Subgenome size was strikingly different in these species, with the HH subgenome showing ~1.5-fold variation with respect to the JJ subgenome (Supplementary Table 5). Analysis of TEs showed that size variation in the subgenomes of O. longiglumis and O. ridleyi could be attributed to a difference in TE abundance (Fig. 2). The ratios of TE content over non-TE content in HH subgenomes were 1.87 and 1.90 for O. longiglumis and O. ridleyi, respectively, whereas the same ratios for the JJ subgenomes were 1.21 and 1.25 for O. longiglumis and O. ridleyi, respectively (Supplementary Table 5). To investigate the preferential expansion of TEs in the HH subgenomes, we investigated the distribution of TEs belonging to six Oryza-specific superfamilies (CACTA, Ty1/Copia, Ty3/Gypsy, MuDR, hAT and LINE) in each subgenome and generated neighbor-joining trees. This analysis did not reveal any evidence of preferential expansion of a TE superfamily over others (Supplementary Figs. 1 and 2). In these species, the majority of long terminal repeat retrotransposons (LTR-RTs) were amplified after polyploidization, estimated ~2.25 million years ago (Ma)24: 83.4% and 83.7% in the HH and JJ subgenomes of O. ridleyi, respectively; and 71.9% and 76.1% in the HH and JJ subgenomes of O. longiglumis, respectively (Supplementary Fig. 3). We then determined whether the variation in TE content in the subgenomes was due to a differential rate of either TE accumulation or TE removal in one of the two subgenomes. Unequal recombination and illegitimate recombination serve as mechanisms for LTR-RT elimination, the former generating solo LTRs by recombining LTRs within or between different LTR-RTs, and the latter acting on dissimilar DNA sequences to remove sections of TE sequences and occasionally leaving incomplete elements30,31. To assess TE removal efficacy in the subgenomes of O. ridleyi and O. longiglumis, we calculated the ratio of solo LTRs to complete LTR-RTs; we found no relevant difference between the HH and JJ subgenomes. In O. ridleyi, the ratios of solo LTRs to complete LTR-RTs in the HH and JJ subgenomes were 1.2 and 1.5, respectively. In O. longiglumis, the ratios were 1.1 and 1.2 in the HH and JJ subgenomes, respectively. These values were similar to those found for Oryza sativa31,32. This evidence confirms previous findings of El Baidouri and Panaud, who showed that the ratio of solo LTRs to complete LTR-RTs does not depend on genome type32. In summary, our results indicate that the difference in subgenome size in the HHJJ genome species is primarily due to preferential accumulation of LTR-RT-related sequences in the HH subgenome. The distribution of the main Oryza-specific TE families showed no preferential expansion of specific families, and the solo LTR to complete LTR-RT ratios showed no evidence of differential efficiency in TE removal, thereby favoring TE accumulation as the primary mechanism contributing to subgenome size disparity.

Fig. 1: Overview of the syntenic landscape and large-scale structural rearrangements of 12 Oryza species (21 (sub)genomes) with the outgroup L. perrieri.
figure 1

Riparian plot showing macro-syntenic regions and large-scale structural rearrangements (large duplications and translocations) across the chromosomes of 12 Oryza species (21 (sub)genomes) and outgroup species L. perrieri. Genome types are shown according to the phylogenetic order in the genus, from the top (O. sativa (AA)) to the bottom (O. meyeriana (GG)). Each chromosome is colored as follows: Chr1, orange; Chr2, beige; Chr3, celeste; Chr4, steel blue; Chr5, navy blue; Chr6, deep purple; Chr7, plum; Chr8, magenta; Chr9, raspberry; Chr10, ruby; Chr11, coral; Chr12, salmon. Chromosomes are scaled by assembly length.

Fig. 2: (Sub)genome size and TE content of the wild Oryza species.
figure 2

a, Correlation between (sub)genome size (Mb) and TE content (Mb) in the nine tetraploid and two diploid wild Oryza species. The significance of the linear correlation (Pearson’s correlation coefficient, R2) was ascertained by two-sided t-test. b, Abundance of the main classes of TEs (Mb). DNA transposons are shown as follows: hAT (DTA), pink; CACTA (DTC), red; Harbinger (DTH), orange; Mutator (DTM), yellow; Mariner (DTT), ochre. LTR retrotransposons are shown as: LTR Copia, dark blue; LTR Gypsy, light blue; LTR unknown, steel blue. Unspecified TEs are shown in white; non-TE content is shown in gray. (Sub)genomes of the species are ordered by genome type (BB, CC, DD, EE, KK, LL, HH, JJ, GG).

Macro-synteny and large-scale chromosomal rearrangements

To understand and visualize the syntenic relationships across the genus, we built a syntenic map of the entire Oryza genus (Fig. 1) that included 21 species (nine tetraploid and two diploid species from this study, plus ten publicly available diploid species, including the International Rice Genome Sequencing Project (IRGSP) reference sequence) and the outgroup species Leersia perrieri (Supplementary Table 8). The synteny map (shown as a riparian plot) tracks the syntenic homologous blocks across the 11 genome types (AA, BB, CC, DD, EE, KK, LL, HH, JJ, FF and GG), showing ~15 Myr of evolution in inversions, duplications and translocations across the genus (Fig. 1). The riparian plot in Supplementary Fig. 4 shows collinear syntenic blocks inverted in consecutive (sub)genome pairs (shown as blue ribbons). The small-scale segmental duplication33 on chromosome 11 and 12 was shared by all Oryza species and L. perrieri. The O. alta (CCDD) and O. grandiglumis (CCDD) genomes shared five unbalanced translocations relative to the O. sativa genome (for example, t(Chr1CC; Chr3), t(Chr6CC; Chr1), t(Chr7CC; Chr4); t(Chr3DD; Chr6); t(Chr4DD; Chr7)) (Supplementary Fig. 5), adding molecular support to the hypothesis of their conspecific nature based on previous morphological, cytogenetic and distribution data5. None of these translocations was found in CCDD species O. latifolia or in any other species included in this study (Fig. 1). Large chromosomal rearrangements are often associated with heterochromatic regions composed of repetitive DNA, such as satellites, minisatellites and simple sequence repeats34,35,36. When comparing the O. alta and O. grandiglumis genomes with an O. sativa reference genome, we found that nine of ten putative translocation breakpoints showed the presence of nearby simple sequence repeats, that is, AT repeats (Supplementary Figs. 6 and 7), suggesting that these sequences may have facilitated the occurrence of these chromosomal rearrangements. Reciprocal translocations between homoeologous chromosomes in polyploid genomes can be found by aligning subgenomes with each diploid relative genome species. When aligning the BBCC genome species with their diploid relative genome species (Oryza punctata (BB) and Oryza officinalis (CC)), we found a reciprocal translocation between Chr1BB and Chr1CC (that is, t(Chr1BB; Chr1CC)), as confirmed with optical maps, in both O. minuta (~9 Mb translocation size) and O. malampuzhaensis (~8 Mb translocation size) (Supplementary Fig. 8).

To provide further support for the structural description of the Oryza genus provided above, we generated genome-type-level pangenomes for the AA, BB, CC and DD genome types (‘Data availability’). For each pangenome, we calculated the core pangenome size, which ranged from 180 Mbp to 270 Mbp (Extended Data Fig. 1a), providing strong support for our estimates obtained by calculating the non-TE-related content in individual genomes (Supplementary Note 2). Regarding large chromosomal rearrangements, the pangenomes for the CC and DD genome types showed (as expected) evidence of the large translocations identified by the macro-synteny analysis, further supporting the presence of these rearrangements (Extended Data Fig. 1b,c).

The syntenic pangenome

To identify core gene sets conserved during Oryza evolution, as well as accessory gene sets that underwent duplication, translocation and/or gene loss, we performed a micro-synteny analysis at the (sub)genome level (Fig. 3). A total of 832,658 gene sequences identified from the gene prediction of 30 Oryza (sub)genomes (Supplementary Table 8) and the outgroup L. perrieri were grouped into 77,482 syntenic gene clusters (groupings of syntenic homologous genes) and used to build a syntenic pangenome (Fig. 3a). For O. alta and O. grandiglumis, owing to chromosomal duplications and unbalanced translocations (as described above), underlying genes were also duplicated and translocated, replacing genes on the chromosomal portions that were lost. In Fig. 3a, the yellow (duplicated genes) and gray (depleted genes) tracks belonging to the same clusters of the dendrogram in either subgenome of O. alta and O. grandiglumis correspond to these genes (Supplementary Table 9). Congruent with the random occurrence of translocations, we could not detect overrepresented gene ontology (GO)-slim terms when comparing O. sativa homologs of either O. alta or O. grandiglumis genes duplicated in the CC (DD) subgenome and depleted in the DD (CC) subgenome with O. sativa homologs of genes in the CC (DD) subgenome.

Fig. 3: The syntenic pangenome.
figure 3

a, Phylogenomic profiling of clusters of syntenic genes across 30 Oryza (sub)genomes (the nine tetraploid and two diploid species presented here and ten additional diploid species listed in Supplementary Table 8) and outgroup species L. perrieri. In the heat map, each row represents a (sub)genome, and each column shows a syntenic cluster (that is, a grouping of syntenic homologous genes across two or more Oryza species and/or between Oryza and L. perrieri). The minimum number of genes in a cluster is two homologous genes in two different species or genera. Clustering of the Oryza (sub)genomes was based on presence and absence patterns of syntenic clusters using Euclidean distance and is shown as a dendrogram on top of the figure. For each syntenic cluster, gene copy-number variation is represented as follows: gene absence, light gray; one gene copy, blue; two gene copies, yellow; three or more gene copies, red. On the left side, vertical bars represent the genome type of the (sub)genomes: AA, dark blue; BB, red; CC, dark green; DD, purple; EE, black; KK, sky blue; LL, orange; HH, light green; JJ, lilac; FF, light gray; GG, light blue; O (outgroup), peach. b, Histogram showing the frequency distribution of syntenic clusters in the 30 (sub)genomes shown in a and shared by increasing numbers of Oryza (sub)genomes (x axis). The legend shows the percentage of core (found in all 30 subgenomes), softcore (found in 27–29 subgenomes), dispensable (found in 2–26 subgenomes) and private syntenic clusters. c, Percentages of genes classified in the different syntenic cluster categories (core (blue), softcore (green) and dispensable (yellow)) in the 30 Oryza (sub)genomes. Percentages of genes classified in private syntenic clusters are not shown.

Syntenic core, softcore, dispensable and private clusters were defined as those present in all 30 (sub)genomes, in 27–29 (sub)genomes (≥90%), in 2–26 (sub)genomes and in one (sub)genome, respectively (Fig. 3b,c). We found 6,256 (8.1%), 9,865 (12.7%), 61,130 (78.9%) and 231 (0.3%) syntenic core, softcore, dispensable and private clusters, respectively (Fig. 3b). Functional enrichment analysis of GO-annotated genes showed that the most significantly overrepresented and underrepresented GO terms referred to expected functions for different portions of the syntenic pangenome (that is, the core portion contained genes with ‘essential’ functions, shared by all species, whereas the softcore and dispensable portions contained genes with ‘malleable’ functions that were partially shared) (Supplementary Table 10).

Reconstruction of the evolutionary history of Oryza

To gain insight into the evolutionary history and maternal origins of the genus Oryza, we first reconstructed a chloroplast-genome-based phylogenetic tree using the chloroplast sequences of 26 Oryza species (the ten chloroplast genomes assembled in this study and 16 chloroplast genomes obtained from NCBI) and the outgroup Leersia japonica (Supplementary Table 11). The chloroplast-genome-based phylogenetic tree showed that after divergence from Leersia, Oryza split into two main clades: a basal clade that includes the GG, FF and HHJJ genome types and a core clade consisting of all other genome types (Fig. 4a). Our tree was highly consistent with previously reported chloroplast-genome-based trees37,38. Of note, the former study did not include O. coarctata and O. schlechteri genomes, whereas the latter merged the KKLL and HHKK genome types into a single HHKK type. Analogous to the tree presented by Zhang and colleagues38, our chloroplast-based tree showed that O. longiglumis and O. ridleyi (both HHJJ genome types) and O. schlechteri (KKLL[HHKK] genome type) do not form a monophyletic group, supporting different maternal donors (Fig. 4a).

Fig. 4: Phylogenetic analysis of the Oryza (sub)genomes.
figure 4

a, Phylogenetic tree based on chloroplast genome sequences. IQ-TREE was used to reconstruct a maximum likelihood phylogeny using the large-single-copy regions of the chloroplast genomes of 26 Oryza species (the chloroplast genomes of the ten species presented here and 16 additional chloroplast genomes of diploid species; Supplementary Table 11). Supporting values next to each branch are SH-aLRT (Shimodaira–Hasegawa-like approximate likelihood ratio) support (%)/ultrafast bootstrap support (%). 100%/100% support values are not shown. Branch length indicates substitutions per site. Trees were rooted using L. japonica as outgroup. b, Time-dated phylogenetic tree based on nuclear gene sequences. The phylogeny was inferred using the maximum likelihood method with a concatenated alignment of 528 single-copy genes. Phylogenetic dating was obtained using the molecular calibration for the crown age of Oryza (14.5 Ma) and the divergence of CC and AA-BB (6 Ma)39,43. c, Ks value distribution plot for HH, JJ, KK and LL genome types (O. ridleyi JJ versus O. longiglumis JJ, closed purple circle; O. ridleyi HH versus O. longiglumis HH, closed green square; O. schlechteri LL[HH] versus O. coarctata LL, closed orange triangle; O. schlechteri KK versus O. coarctata KK, open blue diamond; O. schlechteri LL[HH] versus O. longiglumis HH, green closed diamond; O. schlechteri LL[HH] versus O. ridleyi HH, open green square). The genome types used in the phylogenetic trees refer to definitions based on cytogenetic and hybridization experiments5 and the molecular evidence provided here for the renaming of O. schlechteri from HHKK to KKLL genome type.

To better understand the evolutionary relationships of 21 of the 27 Oryza species for which a chromosome-level assembly was available, we performed coalescent phylogenetic analyses using 3,728 single-copy syntenic genes present in all 30 Oryza (sub)genomes and the outgroup. This gene set enabled grouping of the Oryza (sub)genomes into six highly supported monophyletic groups: AA, BB, CC, DD/EE, KK and LL/HH/JJ/FF/GG (Supplementary Fig. 9). The diploid CC genome (O. officinalis) clustered with the CC subgenomes of BBCC species O. minuta and O. malampuzhaensis but not with the CC subgenomes of CCDD species O. alta, O. grandiglumis and O. latifolia. This was consistent with previous studies39,40 suggesting that a CC genome species (probably O. officinalis (CC)) is the paternal donor of the BBCC tetraploid species, and the maternal donor is a BB genome species (probably O. punctata (BB); Fig. 4a); whereas the CC genome in the CCDD tetraploid species served as the maternal parent and might be different from O. officinalis (Supplementary Fig. 9).

Coalescent phylogenetic analysis of the species in the unclassified group41 (O. coarctata (KKLL) and O. schlechteri (KKLL[HHKK])) showed that the LL genome of O. coarctata (KKLL) clustered with the LL[HH] genome of O. schlechteri (KKLL[HHKK]), and the latter did not form a monophyletic group with the HH genomes of the HHJJ species (O. ridleyi and O. longiglumis) (Supplementary Fig. 9). This incongruity was investigated in greater detail at both the phylogenetic and molecular levels to explore the relationships of the HH, JJ, KK and LL genome types (Fig. 4b,c). The results of this analysis, described in Supplementary Note 3 and Supplementary Figs. 1013, led us to conclude that the ‘HH’ subgenome of O. schlechteri should be renamed ‘LL’ (Fig. 5).

Fig. 5: Consensus tree summarizing origins and evolutionary history of diploid and tetraploid Oryza species.
figure 5

Single-line branches denote diploid species, whereas double-line branches denote tetraploid species. Single dashed lines represent unknown diploid wild relative species. Forward slash (/) indicates that the species names—O. meyeriana and O. granulata—are considered synonyms (https://powo.science.kew.org). One asterisk (*) denotes that the maternal donor is a BB genome species. Two asterisks (**) denote that the maternal donor is a CC genome species. Genome types and known representative species are shown next to the terminal nodes. The relative times of hybridization events are based on the current study. The tree includes the new designation of O. schlechteri as KKLL genome type (the same as O. coarctata) proposed in this work. L. perrieri and L. japonica (here collectively referred to as Leersia) are the outgroups.

To date the Oryza phylogeny, we used the least square dating method42 on a concatenated alignment of 528 single-copy genes and obtained a phylogeny that was highly consistent with that obtained with the coalescent approach (Fig. 4b). Using the molecular calibration of 14.5 Ma for the crown age of Oryza and 6 Ma for the divergence of the CC and AA-BB species39,43, we estimated the divergence times of the genome types and inferred the hybridization times of the tetraploid genomes (Fig. 4b and Supplementary Table 12). Although Zou and colleagues suggested in 2015 that two independent polyploidization events represented the origins of O. malampuzhaensis and O. minuta (both BBCC species)39, the grouping of the BB and CC subgenomes of these two species first, followed by grouping with either diploid relative genome (Fig. 4b), suggests a common origin of the BBCC species. Therefore, we considered the average between 1.53 Ma (the estimated divergence time of the BB subgenomes) and 1.92 Ma (the estimated divergence time of the CC subgenomes) to infer a polyploidization time of 1.73 Myr for the BBCC species. The three CCDD species (O. alta, O. grandiglumis and O. latifolia) are assumed to have originated from a single hybridization event44,45. We therefore considered the average between 2.13 Ma (the estimated divergence time of the CC subgenomes) and 2.94 Ma (the estimated divergence time of the DD subgenomes) to infer a polyploidization time of 2.54 Myr for the CCDD species. O. ridleyi and O. longiglumis (both HHJJ species) are closely related and diverged quite recently46. For these species, we considered the average between 2.39 Ma (the estimated divergence time of the HH subgenomes) and 2.09 Ma (the estimated divergence time of the JJ subgenomes) to infer a polyploidization time of 2.24 Myr. The divergence times of the HH, JJ, KK and LL subgenomes suggested that the formation of the tetraploid genomes of O. coarctata (KKLL) and O. schlechteri (KKLL[HHKK]) occurred much earlier than that of the HHJJ genomes (Fig. 4b).

According to estimates of LTR-RT insertion times (Supplementary Fig. 3), amplification of these retroelements occurred mostly after the polyploidization event in the allotetraploids, with the majority of LTR-RTs (ranging from 77.7% in the LL subgenome of O. coarctata to 98.7% in the DD subgenome of O. latifolia) having inserted in the past 3 Myr.

Homoeologous gene fractionation

Following whole-genome duplication (via allopolyploidization, the hybridization of two or more distinct species; or autopolyploidization, the multiplication of a complete chromosome set within a species), gene copies can be lost from one homoeologous chromosome or the other(s), resulting in gene fractionation. Over evolutionary time, gene fractionation leads to reduction of a polyploid genome back to a diploid state in which the overall genomic structure has changed substantially47. Gene fractionation (Fig. 6a) was measured as a percentage of homoeologous gene retention in the subgenomes of the tetraploid Oryza genomes. A lower percentage of gene retention with respect to the genome-wide average indicates overfractionation (greater gene loss), whereas a higher percentage indicates underfractionation (greater gene retention). Statistical comparisons (using two-sided Wilcoxon rank-sum tests) of gene retention between subgenomes showed the most pronounced difference within O. longiglumis and O. ridleyi, with higher gene retention in the HH subgenome, followed by the species in the officinalis complex, with higher gene retention in the CC subgenome (P < 0.001; Fig. 6a).

Fig. 6: Homoeologous gene retention in Oryza and subgenome equivalence in O. coarctata.
figure 6

a, Distribution of gene retention (percentage, y axis) in the subgenomes of the tetraploid species (x axis). Each genome type is colored as in Figs. 3a and 4b. The red dashed line indicates the average percentage of gene retention calculated genome-wide for each species. P values from two-sided Wilcoxon rank-sum tests and numbers of sliding windows (n) are shown. b, Transcript abundance of homoeologous genes in O. coarctata and their homologs in O. sativa. Gene expression as log2 (TPM + 1) was measured in the leaf and in the root considering the replicates together. P values from two-sided Wilcoxon rank-sum tests are shown. In a and b, the 50th percentiles are defined by middle lines; lower and upper hinges correspond to the 25th and 75th percentiles. The upper whisker extends from the hinge to the largest value at most 1.5 times the interquartile range from the hinge. The lower whisker extends from the hinge to the smallest value at most 1.5 times the interquartile range of the hinge. Data beyond the end of the whiskers were considered to be outliers and plotted as individual points. c, Homoeologous gene pair expression bias (B) in the leaf (left) and the root (right) of O. coarctata. Blue and orange bars represent the expression of homoeologs biased toward KK (B < −1) and LL (B > 1) subgenomes, respectively. Homoeolog pairs with −1 ≤ B ≤ 1 (gray bars) are defined as nondominantly expressed. N represents the number of homoeologous gene pairs in the three categories (NKK, homoeologous gene dominantly expressed in KK subgenome; NLL, homoeologous gene dominantly expressed in LL subgenome; Nnonbiased, homoeologous gene not dominantly expressed). BKK, BLL and Bnonbiased represent average expression bias for the homoeologous pairs in the respective categories. nf, nonfractionated (homoeologous gene pairs); f, fractionated genes; Os-ph, O. sativa genes homologous to nonfractionated O. coarctata genes (paired homologous); Os-sch, O. sativa genes homologous to fractionated O. coarctata genes (single-copy homologous).

We used the total average percentage of gene retention and the inferred polyploidization time (Fig. 6a and Supplementary Table 12) of the BBCC, CCDD and HHJJ species to estimate the average gene loss per million years for these genome types (Supplementary Fig. 14). Gene loss was faster in younger species (the BBCC species that originated ~1.73 Ma) and progressively slowed in older species (the HHJJ and CCDD species that originated ~2.24 Ma and ~2.54 Ma, respectively).

Subgenome dominance and equivalence in O. coarctata

Subgenome dominance is a widely observed phenomenon in polyploid plant genomes (for example, in Brassica rapa48, monkeyflower49 (Mimulus peregrinus), maize50 (Zea mays), cotton51 (Gossypium) and switchgrass52 (Panicum virgatum)) in which genes from one subgenome tend to be expressed at higher levels than those from the homoeologous subgenome. Over evolutionary time, the less expressed subgenome (that is, the submissive subgenome) tends to lose more homoeolog copies than the more highly expressed subgenome, generating biased fractionation. Alternatively, subgenome equivalence means neither genome is ‘dominant’ over the other, and gene loss occurs at an approximately equal extent in the subgenomes47. Absence of biased fractionation and subgenome dominance has been observed in Capsella bursa-pastoris53, pear54 (Pyrus bretschneideri) and Chinese sprangletop55 (Leptochloa chinensis). Cucurbita ssp. maxima and moschata (pumpkin and gourd) exhibit biased gene-loss patterns in some chromosomes and lack of subgenome dominance56. Similarly, the allotetraploid broomcorn millet (Panicum miliaceum) shows subtle biased gene fractionation and no significant dominance of either subgenome57. To investigate these phenomena in the Oryza tetraploids, we analyzed gene expression patterns in O. coarctata (KKLL), owing to its importance as a halophytic species58 and the availability of transcriptome data. The average gene retention in O. coarctata was 56.1%, meaning that, on average, ~56% of genes were retained in duplicate and were syntenic in the homoeologous chromosomes (Supplementary Fig. 15). Overall, the KK subgenome showed overfractionation (that is, greater gene loss) with respect to the LL subgenome (P < 0.001, two-sided Wilcoxon rank-sum test; Fig. 6a).

To investigate the correlation between gene expression and gene loss, we analyzed the transcript abundance of nonfractionated genes (homoeologous gene pairs) and fractionated genes in O. coarctata and compared it with that of paired homologous and single-copy homologous genes in O. sativa, in leaf and root tissues. The median gene expression in O. coarctata was significantly higher (P < 0.001, two-sided Wilcoxon rank-sum test) for homoeologous gene pairs than for fractionated genes (Fig. 6b and Supplementary Table 13). O. sativa genes homologous to nonfractionated O. coarctata genes (O. sativa paired homologous genes) showed significantly higher expression levels (P < 0.001, two-sided Wilcoxon rank-sum test) than O. sativa genes homologous to fractionated O. coarctata genes (O. sativa single-copy homologous genes) (Fig. 6b and Supplementary Table 13). This evidence suggests that in O. coarctata, genes with two homoeologous copies tend to be expressed at higher levels than genes that have undergone fractionation. Moreover, O. sativa genes homologous to O. coarctata nonfractionated genes tended to be expressed at higher levels than O. sativa genes homologous to O. coarctata fractionated genes. These results suggest that in O. coarctata, genes with low expression tend to be preferentially lost compared with highly expressed genes during the process of rediploidization over evolutionary time. Similar analyses and results have been reported for broomcorn millet57, for which the expression of homoeologous genes was compared with that of homologous genes in the diploid wild relative Panicum hallii.

We then compared expression in homoeologous gene pairs to investigate whether there was subgenome dominance in the tissue-specific transcriptome data of O. coarctata. Among the dominantly expressed homoeologous gene pairs, a higher fraction was dominantly expressed in the KK subgenome in leaf and in the LL subgenome in root (one-sided binomial test, P < 0.001) (Fig. 6c and Supplementary Table 14).

Taken together, the results of our analyses show that despite greater gene retention in the LL subgenome of O. coarctata (suggesting biased gene fractionation), homoeologous genes are expressed at higher levels in one subgenome over the other in a mosaic form, suggesting subgenome expression equivalence59. Additional tissues need to be analyzed to provide further evidence of subgenome equivalence in O. coarctata.

Discussion

We have generated a comprehensive resource of publicly available wild diploid and tetraploid Oryza reference genomes spanning all tetraploid genome types and the EE and GG diploid genome types, using PacBio long-read sequencing, optical maps and long/short read back-mapping validation. Previous efforts in this direction were made within the framework of the International Oryza Map Alignment Project (IOMAP60).

Analysis of our dataset resulted in insights into genome evolution within the genus. We generated a macro-synteny description of the Oryza (sub)genomes and built and characterized a syntenic pangenome. We characterized large chromosomal rearrangements that resulted in our present-day inventory of living Oryza species. For example, the description of five large nonreciprocal translocations shared between O. alta (CCDD) and O. grandiglumis (CCDD) but not present in O. latifolia (CCDD) added robust evidence to a previous hypothesis of their conspecific nature38,45. Given that O. alta is considered to be a synonym for O. latifolia in the Plants of the World Online (https://powo.science.kew.org), and intermediate forms between O. alta and O. grandiglumis have been observed61,62, genetic studies at the population level are needed to investigate the history and composition of these species and potentially resolve their taxonomic classification63. According to estimation of LTR-RT insertion time in the allotetraploids, TE amplification occurred mostly after polyploidization. The distribution of LTR-RT insertion events in O. coarctata (KKLL) and O. longiglumis (HHJJ) showed a higher fraction of older LTR-RT elements in these two species compared with the other Oryza species. This could be due to a slower mutation rate and/or a less efficient TE removal mechanism in these species. This analysis corroborates previous findings32 showing that LTR-RT insertion time and accumulation are not dependent on genome type. We analyzed an interesting example of size variation between homoeologous genomes in the species of the ridleyi complex and identified TEs as the driving force of genome size change, thereby clarifying the mode of amplification in these species—that is, the involvement of the entire TE complement, as opposed to the rapid explosion of a few TE families as shown previously in O. australiensis64 (EE) and Oryza granulata65 (GG).

We retested the phylogenetic placement of the genomes, confirming previous findings7,37,40,44,66. We also estimated divergence times of the subgenomes and inferred polyploidization timing for the BBCC, CCDD and HHJJ species. Our whole-genome-scale phylogenetic analysis provided robust confirmation of previous inferences (based on the analysis of a few genes) regarding the origins of the HH, KK and JJ genome types, according to which O. schlechteri and O. coarctata are more closely related to each other and share the same genome type44. In addition, genome sequence similarity, TE abundance and distribution, and presence/absence patterns of syntenic genes clearly showed greater similarity at the molecular level between the LL[HH] subgenome of O. schlechteri and the LL subgenome of O. coarctata compared with the other HH subgenomes. Our phylogenetic and molecular evidence is consistent with a previous report on O. coarctata6, the authors of which proposed the renaming of this species (previously designated an HHKK genome type) as a KKLL genome type. Thus, we recommend the renaming of O. schlechteri from an HHKK to a KKLL genome type, as shown in the phylogenetic tree in Fig. 5.

Our investigation of gene fractionation in the tetraploids found variable patterns of gene fractionation among subgenomes, as recently described in other plant systems47. Investigation of subgenome dominance/equivalence in leaf and root tissue of O. coarctata (KKLL) did not reveal evidence of expression dominance of one subgenome over another, even though gene fractionation was higher in subgenome KK. Additional transcriptome data for O. coarctata and the other tetraploid species will be needed to investigate the phenomenon of subgenome dominance/equivalence in Oryza.

This dataset provides a valuable resource for future investigations, which could include the discovery of adaptive genes and/or traits to improve cultivated rice, the neodomestication of the wild Oryza species, and population genetics studies of wild Oryza species across their species ranges for conservation and enhancement of their genetic diversity for the planet’s future63.

Methods

Sample collection

Single seed descent germplasm for O. alta (IRGC 105143), O. australiensis (IRGC 100882), O. grandiglumis (IRGC 105669), O. latifolia (IRGC 100890), O. longiglumis (IRGC 106525), O. malampuzhaensis (IRGC 80765), O. meyeriana (IRGC 106473), O. minuta (IRGC 101141), O. ridleyi (IRGC 100821) and O. schlechteri (IRGC 82047) were obtained from the International Rice Research Institute (IRRI, Philippines) under the Standard Material Transfer Agreement. Seeds were sown in potting soil and grown at an air temperature of 24–29 °C with 15–25% humidity in a greenhouse. O. coarctata (IRGC 104502) leaf tissue was obtained from a vegetative voucher plant imported from IRRI through the USDA and grown at an air temperature of 24–29 °C with 15–25% humidity in the greenhouse at the University of Arizona.

Construction of genome-type-level pangenomes

Given the large sequence variation among distant Oryza species, pangenomes for the AA, BB, CC and DD genome types were generated separately (that is, genome-type-level pangenomes). The pangenomes were built using PanGenome Graph Builder67, and each chromosome was considered individually. The settings used were as follows: for AA genomes, the similarity threshold was set to 90%, the k-mer length to 7 bp and the segment length to 15,000 bp; for the other genome types, the similarity threshold was lowered to 80%, with the other settings remaining the same. The core pangenome size statistics were calculated by applying the Panacus68 tool to the GFA files (https://github.com/GFA-spec/GFA-spec). The AA pangenome included the species Oryza barthii, Oryza glaberrima, Oryza glumaepatula, Oryza meridionalis, Oryza nivara, Oryza rufipogon and O. sativa. The BB pangenome included the species O. punctata, and O. malampuzhaensis and O. minuta BB subgenomes. The CC pangenome included the species O. officinalis, and O. malampuzhaensis, O. minuta, O. alta, O. grandiglumis and O. latifolia CC subgenomes. The DD pangenome included O. alta, O. grandiglumis and O. latifolia DD subgenomes.

TE annotation

Extensive De-novo TE Annotator (EDTA69 v.1.9.0) was used to generate a de novo nonredundant TE library for each genome, using default settings. De novo TE libraries were then used to annotate TEs in each respective genome using EDTA and to soft-mask genome sequences using RepeatMasker70 (v.4.1.0) for the subsequent step of gene prediction.

Gene prediction

After evaluating read quality with FastQC (v.0.11.8) and removing adapters with Trimmomatic71 (v.0.38), we aligned RNA sequencing (RNA-seq) reads to their respective genomes using the STAR72 (v.2.7.8a) aligner embedded in the OmicsBox tool (Bioinformatics Made Easy, BioBam Bioinformatics, https://www.biobam.com/omicsbox, v.2.0.36). Minimap2 (ref. 73) aligner implemented in OmicsBox was used to align high-quality transcripts from full-length isoform sequencing (Iso-seq) reads. Gene prediction using extrinsic evidence was carried out on soft-masked genomes using Augustus74 (v.3.4.0) software embedded in OmicsBox with a model training set derived for O. sativa, for all species except O. australiensis. For O. australiensis, MAKER-P75 (v.3.01.03) was used to carry out gene prediction using RNA-seq data from leaf tissue. MAKER-P was run on the soft-masked genome of O. australiensis with the Augustus74 (v.3.2.1), SNAP76 (v.2013-11-29) and Fgenesh+77 gene predictors. Genes and transcripts were retained if the annotation edit distance was less than 1.

For the macro-synteny and phylogenetic analyses, gene models in each genome species were filtered to retain the longest isoform using the AGAT tool suite (https://agat.readthedocs.io/en/latest/index.html). The resulting gene prediction for each genome species was evaluated for completeness using the poales_odb10 database in BUSCO29 (v.5.1.2) with assessment mode ‘proteins’.

Functional gene annotation

Functional gene annotations were generated for the 11-genome dataset presented in this work and ten additional diploid species obtained from NCBI and used to build the syntenic pangenome. Diamond blastp (v.2.1.8) was used to search the protein sequences in the NCBI nonredundant database (https://ftp.ncbi.nlm.nih.gov/blast/db/v5/) using an e-value threshold of 1 × 10−6, a minimum sequence identity of 80% and a minimum subject cover of 40% to report an alignment. The best blast hit for each predicted protein sequence was retained. Functional information was then translated into GO terms using the Blast2GO78 (v.6.0) suite. GO mapping of the protein blast hits onto the Gene Ontology Annotation database (v.2023.08) was performed (GO terms refer to the UniProt ID-mapping). GO annotation of the GO term candidates was performed using an annotation score cut-off of 75% to select the GO term for a given GO branch.

GO enrichment analysis

O. sativa homologs of either O. alta and O. grandiglumis genes duplicated in the CC(DD) subgenome and depleted in the DD(CC) subgenome were used for GO enrichment analysis in PANTHER (v.18.0) (https://www.pantherdb.org/). The reference gene set consisted of O. sativa homologs of either O. alta or O. grandiglumis genes in the CC(DD) subgenome. PANTHER GO-Slim annotations for each ontology (molecular function, biological process and cellular component) were assigned to test and reference gene sets. Fisher’s exact test was used with a P value threshold of 0.001, and P values were corrected using the false discovery rate. For functional enrichment analysis of the syntenic pangenome, two-tailed Fisher’s exact test with a P value threshold of 0.001 was used on the set of GO-annotated genes in each pangenome category (core, softcore and dispensable) and a reference set containing all GO-annotated genes in the pangenome. P values were corrected using the false discovery rate. The top ten overrepresented and underrepresented GO terms were then selected for each GO ontology.

Insertion time of LTR-RTs

The insertion time of complete LTR-RTs in the 11 genome species was estimated using the method proposed by SanMiguel and colleagues79, which is based on the nucleotide distance between the two flanking LTR sequences of a complete LTR-RT. For each genome species, LTR sequences flanking complete LTR-RTs were aligned pairwise using the global aligner STRETCHER80 from the EMBOSS tool suite (v.6.6.0). The nucleotide distance between the two LTR sequences was quantified using the Kimura 2p method81 as implemented in the DISTMAT software80 from EMBOSS (v.6.6.0). The nucleotide distance (D) was then converted to insertion time (T) in Myr using the formula: T \(=\frac{D}{2{\mu }} {\times} {10}^{6}\), where μ is the substitution rate of 1.3 × 10−8 per site per year82. Distributions were represented as percentages of complete LTR-RT insertions in bins of width 0.5 (Myr).

TE amplification in the homoeologous subgenomes of O. ridleyi and O. longiglumis

Conserved tracts of 100 amino acids from transposase and reverse transcriptase enzymes were used as queries in tBlastn83 (v.2.14.0) to identify DNA TEs and retroelements, respectively, in the subgenomes of O. ridleyi and O. longiglumis. Five-hundred paralogs covering at least 80% of the query length were randomly selected among the tBlastn output hits for six TE superfamilies (Ty1/Copia and Ty3/Gypsy LTR retroelements, LINE and CACTA DNA TEs, MuDR and hAT) and aligned with their respective subgenomes using MUSCLE84 (v.3.8.425). The multiple sequence alignments for each TE superfamily were then used to build a neighbor-joining tree using MegaX85 (v.10.1). Bootstrap values were calculated in MegaX for 1,000 replicates using the pairwise deletion option and shown on the tree when greater than 50. Evolutionary distances were estimated using the Poisson correction distance86.

Macro- and micro-synteny analysis

To track the genomic dynamics across the genus Oryza, we performed a macro-synteny analysis using GENESPACE87 (v.1.3.1). This software implements and improves features from OrthoFinder88 (v.2.5.4) and MCScanX_h89 to link gene sequence homology with gene coordinates and traces the processes of genome polyploidization, reduction, rearrangement and translocation across a set of genomes. The protein sequences of the nine tetraploid and two diploid genomes reported in this study, as well as the publicly available protein-coding genes of ten diploid Oryza species and the outgroup L. perrieri (Supplementary Table 8), were analyzed in GENESPACE using default settings. The macro-synteny analysis was run with each tetraploid genome separated into two individual subgenomes (that is, the analysis was performed considering each genome type separately). The macro-synteny results were visualized as a riparian plot using the embedded plot_riparian function, and the homology and collinearity information at the subgenome level was used to build a synteny-constrained phylogenomic framework of the Oryza (sub)genomes.

Gene micro-synteny analysis across the Oryza genomes was performed using a modified version of a previously developed pipeline90 in which pairwise synteny was inferred by GENESPACE and syntenet clusters were generated using custom developed scripts (https://github.com/xiaoyezao/Rice-Phylogenomics). Copy-number variation of syntenic homologs (a group of genes derived from a single common ancestor and retained in syntenic relationships) was profiled using the SYNTENET91 (v.1.5.2) package, and visualized using heat maps. The syntenic homologs were then used for multilocus phylogenomic analysis to infer the phylogeny of the (sub)genomes, as described in the next section.

Phylogenetic relationship analysis of the chloroplast and nuclear genome across the Oryza genus

We performed a phylogenomic analysis using chloroplast-based and multinuclear-gene-based approaches to infer the evolutionary history of the Oryza species/genomes. For chloroplast-based phylogeny (https://github.com/nam-hoang/rice_cp_phylo_analysis), we used the chloroplast genome sequences of ten Oryza species (O. malampuzhaensis, O. minuta, O. alta, O. grandiglumis, O. latifolia, O. coarctata, O. schlechteri, O. ridleyi, O. longiglumis and O. meyeriana) assembled from whole-genome PacBio sequencing data (in this study), plus 17 publicly available chloroplast genome sequences from 16 diploid Oryza species and the outgroup L. japonica (Supplementary Table 11). To construct the maximum likelihood phylogenetic tree, we aligned the large-single-copy regions of the chloroplast genomes using MAFFT92 (v.7.480). Poorly aligned regions were trimmed using trimAL93 (v.1.4) with the option ‘-automated1’. The alignment files were subjected to IQ-TREE94 (v.1.6.12) with default settings (1,000 bootstrap iterations) and with the best-fit substitution model identified by ModelFinder95. The resulting tree was visualized in FigTree (v.1.4.3) (http://tree.bio.ed.ac.uk/software/figtree/) and rooted using L. japonica as the outgroup. For nuclear-gene-based phylogeny, a nuclear gene dataset was obtained from the GENESPACE87 output as described above. A total of 3,728 single-copy syntenic genes found in the 30 Oryza (sub)genomes and the outgroup genome were used for the phylogenomic analysis. Syntenic homologous genes were aligned using MAFFT92 (v.7.520) (-genafpair; -maxiterate 1,000), and the alignments were cleaned using trimAl93 (v.1.4.1) (-gt 0.6; -st 0.001). The gene trees were inferred using RAxML-NG96 (v.1.2.0) with the Q.plants model97. For all gene trees, a coalescent algorithm in Astral-Pro2 (ref. 98) (v.1.16.1.3) was used to infer the phylogeny at the (sub)genome level. Branch supports of the Astral tree were estimated as local posterior probabilities99.

To date the Oryza phylogeny, the least square dating method42 as implemented in IQ-TREE100 v.2.3.6 was used on a concatenated alignment of 528 single-copy genes. The substitution model was determined by ModelFinder implemented in IQ-TREE v.2.3.6, and the best maximum likelihood tree was obtained by maximum likelihood inference with 20 initial parsimony trees. The resulting phylogeny was highly consistent with the phylogeny estimated using the coalescent approach described above, and it was used for phylogenetic dating with a molecular calibration of 14.5 Ma for the crown age of Oryza and 6 Ma for the divergence of CC and AA-BB39,43. The confidence interval of the estimated dates was obtained by resampling branch lengths 100 times. Machine-readable phylogenetic trees are provided in Supplementary Data 2.

Synteny-based phylogenetic analysis

The synteny-based phylogenetic inference was performed following the Syn-MRL approach101 as implemented in the R package syntenet91 v.1.7.1. Specifically, the pangene table of the 31 (sub)genomes generated from the GENESPACE macro-synteny analysis was phased using a custom script (https://github.com/xiaoyezao/Rice-Phylogenomics) to identify syntenic homologous genes and syntenic clusters. Then, phylogenomic profiling of the syntenic clusters to determine which (sub)genomes contained which clusters was performed using the cluster_network function as implemented in syntenet. The phylogenomic profile was converted into a binary matrix in which the presence or absence of a syntenic gene from a given (sub)genome was coded as 1 or 0, respectively. Finally, phylogenetic analysis of the binary matrix was conducted using IQ-TREE v.2.3.6 with 1,000 Shimodaira–Hasegawa-like approximate likelihood ratio bootstrap replicates.

Ks analysis

The SynMap102 and CodeML103 tools, implemented in the Comparative Genomics (CoGe104) platform v.7 (https://genomevolution.org/coge), were used to identify collinear blocks of homologous genes between subgenomes of the same type (for example, between the KK subgenomes of O. coarctata and O. schlechteri) and calculate the fraction of synonymous substitutions per synonymous site (Ks). The SynMap analysis was run using default parameters (comparison algorithm: Last; window size: 100 genes; minimum number of aligned pairs: 5 genes; maximum distance between two matches: 20 genes). The distribution of Ks values ≤0.3 for each pair of subgenomes was plotted using a bin width of 0.01.

Gene fractionation

Gene fractionation was run on each tetraploid genome using SynMap102 and FractBias105, both implemented in CoGe104 (v.7). SynMap was used to define gene homoeology and collinearity between subgenomes in each tetraploid genome as described above. FractBias was used to calculate and plot gene retention by setting a quota align ratio of 1:1.

Expression of homoeologous genes in O. coarctata

To identify pairs of homoeologous genes (that is, orthologous genes between the KK and LL subgenomes) among the predicted genes in the O. coarctata genome, Blastp106 (v.2.14.0) was used with an e-value cut-off of 1 × 10−5, and reciprocal hits were filtered using a bits score threshold of 50 and a minimum alignment identity of 50% for reciprocally aligning regions. The same method was used to identify genes in O. sativa homologous to genes in O. coarctata. In O. coarctata, genes were divided into two main categories: nonfractionated genes, homoeologous gene pairs found in both the KK and LL genomes; and fractionated genes, those found in one subgenome with no homoeologous pair in the other subgenome. In O. sativa, genes were also divided into two main categories: paired homologous genes, O. sativa genes that had homology to nonfractionated gene pairs in O. coarctata; and single-copy homologous genes, O. sativa genes that had homology to fractionated genes in O. coarctata. As the density distribution of log2-transformed raw read counts was homogenous in all samples across the three plant growth tanks (Supplementary Fig. 16), we considered the individual plants as technical replicates. Paired-end reads from all the plants were mapped on each individual subgenome sequence of O. coarctata using TopHat2 (ref. 107) (v2.1.1). For reads that mapped to both subgenome sequences, we used EAGLE-RC108 (v.1.1.1) to determine the likelihood of read alignment against each subgenome without knowing the genotype differences explicitly (-ngi) and chose the best alignment. Expression levels were determined using transcripts per kilobase of exon model per million mapped reads (TPM) with TPMCalculator109 (v.0.0.3). To compare expression of genes in O. coarctata homologous to genes in O. sativa, we obtained publicly available datasets of Illumina paired-end RNA-seq of flag leaf (SAMN22452874, SRR16526865, SRR16526866), leaf (SRR4017523, SRR4017527) and root (SRR25078452, SRR25078455, SRR25078456) of O. sativa. Sequencing data were mapped to the O. sativa IRGSP reference sequence using TopHat2 (ref. 107) (v.2.1.1). TPM values were calculated as described above (the code is available at https://github.com/gopal-misra/Oryza_coarctata_expression). Differences in expression levels in the six gene categories were visualized as log2(TPM + 1), and P values between the nonfractionated and fractionated and between the paired homologous and single-copy homologous categories were obtained using a two-sided Wilcoxon rank-sum test.

Expression bias (B) was quantified in homoeologous gene pairs over leaf and root tissue using the log2 fold change:

$$B={\log }_{2}\left(\frac{{{\mathrm{TPM}}}_{{\mathrm{LL}}}+1}{{{\mathrm{TPM}}}_{{\mathrm{KK}}}+1}\right),$$

where TPMKK and TPMLL indicate the expression levels in the KK and LL subgenomes, respectively. Expression bias values greater than 1 (B > 1) indicate that the homoeologous copy on the LL subgenome is dominantly expressed, whereas expression bias values less than −1 (B < −1) indicate that the homoeologous copy on the KK subgenome is dominantly expressed49. Homoeologous gene pairs showing less than a twofold change (−1 ≤ B ≤ 1) were classified as nondominantly expressed. One-sided binomial test and a P value threshold of 0.001 were used to assess differences in numbers of dominantly expressed genes in the subgenomes for each tissue.

Statistics and reproducibility

To assess the significance of the correlations between (sub)genome size (Mb) and TE content (Mb), we calculated Pearson’s linear correlation coefficient (R2) with a two-sided t-test. For functional enrichment analysis of the syntenic pangenome, two-tailed Fisher’s exact test with a P value threshold of 0.001 was used on the set of GO-annotated genes in each pangenome category (that is core, softcore and dispensable) and a reference set containing all GO-annotated genes in the pangenome. P values were corrected using the false discovery rate. To test for differences in percentages of gene retention between subgenomes in each tetraploid species, we used a two-sided Wilcoxon rank-sum test with a P value threshold of 0.001 (the wilcox.test function in R v.4.3.2 with continuity correction).

To test for differences in the expression levels of genes of six categories (four gene categories in O. coarctata: nonfractionated genes (nf), homoeologous gene pairs found in both the KK and LL genomes; and fractionated genes (f), those found in one subgenome with no homoeologous pair in the other subgenome; and two gene categories in O. sativa: paired homologous genes (ph), O. sativa genes that had homology to nonfractionated gene pairs in O. coarctata; and single-copy homologous genes (sch), O. sativa genes that had homology to fractionated genes in O. coarctata), we used a two-sided Wilcoxon rank-sum test and a P value threshold of 0.001 (the wilcox.test function in R v.4.3.2 with continuity correction). To assess differences in the number of genes dominantly expressed in the subgenomes of O. coarctata, in the leaf and root, we used a one-sided binomial test and P value threshold of 0.001.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.