Introduction

The intricate regulation of gene expression is crucial for executing complex biological processes in higher organisms. Maize, a classical genetic model in gene regulation, has been extensively studied since Dr. Barbara McClintock’s groundbreaking discovery of transposable elements (TEs), commonly known as jumping genes, which cause the mosaic pigmentation seen on maize kernels1. Recent assembly and analysis of maize genomes have unveiled a highly sophisticated and hierarchical manner of gene regulation in maize, characterized by the prevalence of distal cis-regulatory elements (CREs)2, particularly enhancers, that are likely attributed to the proliferated maize genome (~2.3 Gb) with over 85% DNAs composed of TEs. Meanwhile, genetic mapping efforts in maize unveiled an enrichment of quantitative trait loci (QTLs) within non-genic regulatory regions3, including well-known loci such as tb14,5, vgt16, KRN47, ZmCCT98, ZmCCT109, and UPA110. On a higher-dimensional scale, the widespread distal CREs in maize can orchestrate with transcription factors (TFs) and other trans-acting elements while simultaneously interacting with promoters or other gene-proximal regions to regulate gene expression through chromatin interactions2. These findings underscore the crucial role of regulatory sequences in controlling maize development and trait variations.

The progress of high-throughput chromatin profiling assays, such as ATAC-seq, CUT&Tag and DAP-seq technologies, has revolutionized the genome-wide identification of CREs in plants at base-pair resolution11. ATAC-seq utilizes engineered hyper-sensitive Tn5 transposons preloaded with sequencing adaptors to selectively target DNAs within accessible chromatin regions (ACRs)12. ACRs provide biomacromolecules physical access to DNAs and are usually colocalized with CREs and transcription factor binding sites (TFBSs), thus playing a critical role in transcriptional regulation13. In maize, a series of studies have employed ATAC-seq to profile ACRs, which typically encompass tens of megabases (Mbs) across major maize tissues including seedlings14,15, roots16, leaf2,17, tassel18 and ear2,18. Apart from these progresses in bulk tissues, ATAC-seq has also been continually optimized to work with the increasingly smaller sample input, even down to a single cell. A recent study utilized single-cell ATAC-seq to investigate the chromatin dynamics associated with cellular heterogeneity in a few maize tissues19. However, as a model species with substantially high tissue complexity, the existing ATAC-seq data in maize is undoubtedly unsaturated, and an in-depth characterization of the accessible chromatins in maize is still awaited.

In comparative genomics, it is well established that highly constrained DNA sequences shared among related species are essential for organismal survival and may have undergone stronger purifying selection. However, those that specifically exist in a single or a small group of species are usually under positive selection and responsible for innovative traits20,21. In the past few years, increasing evidence indicates that a considerable proportion of genome sequences including not only the coding regions but also massive CREs in the non-coding regions can be evolutionary constrained20,21,22,23. Since CREs are typically associated with epigenetic features such as accessible chromatin, active histone modifications, low DNA methylation, and unique binding of TFs11, an emerging research field of evolutionary epigenomics has combined both epigenomic and comparative genomic approaches to study the constraint of DNAs associated with CREs20,21,24. In mammals, the Zoonomia project estimated that approximately half of the human CREs are evolutionarily constrained in 240 other mammal genomes22. Another work reported that approximately 80% and 50% of the evolutionary constrained DNA bases in primates are located outside protein-coding exons and have no annotation in human ENCODE data21. Relevant scenarios are also found in plants, where more than half of the highly constrained DNAs between diploid potato and other Solanaceae genomes occur in non-coding regions23. Similarly, several published works have utilized evolutionary constraint to identify conserved non-coding sequences (CNSs) in maize and other major crops25,26,27, however they usually lacked enough ATAC-seq or other similar data to further classify potential CREs within these CNSs. With the technical progress of CRE annotation methods and the availability of multiple high-quality genome sequences in diverse Poaceae species, we can now generate a comprehensive map of regulatory loci in maize2,19,28,29,30,31,32,33 and trace their evolutionary trajectory during approximately 100 million years of Poaceae genome evolution34, revealing new insights into maize trait innovations compared to its wild ancestor and other grass species.

Here, we build a multi-tissue map of accessible chromatin in over ten major maize tissues and explore their interspecific and intraspecific constraint in multiple Poaceae and maize genomes. Unlike in primates, our analysis reveals strikingly low constraint (~1%) of ACRs in Poaceae, and highlights the major role of TEs in generating maize specific ACRs primarily occurring in regions distal to genes. By comparison, approximately half of ACRs are constrained in diverse maize genomes, exhibiting higher enrichment in trait-associated SNPs compared to other highly variable ACRs. We also uncover a set of ACRs likely associated with maize domestication, demonstrating their entanglement with protein-coding genes that either are artificially selected or differentially expressed between maize and wild teosintes. Our work establishes a framework for the evolutionary analysis of massive plant regulatory elements, and unveils some potentially important regulatory loci involved in maize speciation and domestication.

Results

A multi-tissue map of accessible chromatins in maize

We profiled the accessible chromatins for 12 major tissues or cell types in maize (Supplementary Data 1). Extending from the earlier data of ear2,18, tassel18, leaf2, seedling14,15, root16 and mesophyll17, we generated the ATAC-seq data for the embryo, endosperm, pericarp, shoot apical meristem (SAM), silk and stem (Fig. 1a). Each data had two biological replicates that exhibited high reproducibility, canonical enrichment near the transcription start sites (TSSs) of genes, and clustered well in both principal component (PCA) and heatmap analysis (Supplementary Fig. 1and Supplementary Fig. 2). Therefore, the data from two biological replicates were combined for subsequent analysis. Around 26,743 to 39,561 ACRs (Fig. 1b) were identified in each tissue, covering approximately 0.67% to 1.0% of the B73 reference genome (Fig. 1b). As expected, ACRs were mainly distributed in euchromatin regions (Fig. 1c), aggregating into some hotspots conserved or that specifically existed in different tissues (Supplementary Fig. 3). We further classified ACRs into three categories according to their intersection with protein-coding genes2: genic ACRs (gACRs, overlapping ≥1 bp with genes, 36% to 52%), proximal ACR (pACRs, overlapping ≥1 bp with 2 kb regions flanking genes, 19% to 25%) and distal ACRs (dACRs, those other than gACRs and pACRs, 29% to 46%) (Fig. 1d). The proliferation of the maize genome has led to a significant number of dACR located in “gene deserts” regions (100 kb away from the nearest gene, Supplementary Fig. 4a). Interestingly, these ultra-distal ACRs showed high conservation across the tissues we profiled and were independently verified by maize single-cell ATAC-seq data19 and ChIP-seq data2 of various active histone modifications (Supplementary Fig. 4a and 4b). They were also found to be entwined with chromatin interactions, with some pinpointed to genes in 3D genomic data2 (Supplementary Fig. 4b), suggesting the cis-regulatory role of maize ACRs in both local and distal manners.

Fig. 1: A multi-tissue map of accessible chromatins in maize.
figure 1

a Overview of the 12 major maize tissues we profiled in this study. Except for the ear and seedling, which had two and three ATAC-seq data, the other tissues had a single ATAC-seq data. We hereafter used sample to represent each of the 15 ATAC-seq data. b The number and length of ACRs identified in each sample (seedling1, seedling2 and seedling3 represent three different seedling samples). c Circos plot showing the chromosomal distribution of ACRs in each sample. Tracks from outer to the inner represented chromosome ideograms, retrotransposons, DNA transposons, gene density and ACR density of 15 samples, respectively. d The distribution of ACRs on different genomic features (genic, proximal and distal). e Bar plot showing the composition of different TE superfamilies in ACRs and random control regions. f The ranking of enriched motifs in ear (2 ~ 4 mm), root, and mesophyll. Different TF families are indicated by colored dots. The P-value of each motif was estimated using one-sided Fisher’s exact test. g Upset plot showing the ACRs that conserved or specifically existed across 15 samples. A plot of saturation analysis by fitting the Michaelis-Menten equation was attached and the error bars represent the average value ± standard deviation (SD). Source data are provided as a Source Data file.

In contrast to the maize genome that primarily composed of TEs (~85%), only ~25% of ACRs overlapped with TEs (Fig. 1e), suggesting the enrichment of ACRs for low-copy sequences in maize. Specifically, dACRs (~40%) had higher TE composition over pACRs (~23%) and gACRs (~15%). With respect to TE superfamilies, Ty3 (Gypsy) accounted for over half of TEs in dACRs, while Helitrons enhanced substantially in pACRs and gACRs which was consistent with their strong preference for gene-rich regions in maize35 (Supplementary Fig. 5a). It was worth noting that the insertion time of LTRs associated with ACRs seemed to be more ancient than the remaining non-association LTRs (Supplementary Fig. 5b). Unlike TEs, simple repeats have been reported to be enriched in CREs in humans36 and also found to be overrepresented in maize ACRs (Supplementary Fig. 5c). Specifically, they exhibited a notable valley in the core region of ACRs while being elevated in the flanking regions, which was consistent with their presumed role in assisting the binding of TFs36.

By integrating the data of PlantTFDB37, JASPAR38 and DAP-seq2,39, the DNA binding motifs of 306 maize TFs from 36 families were curated (Supplementary Data 2). We discovered the motifs of 243 TFs within 50 bp sequences flanking the summit of ACRs, with each tissue having 157 to 191 TFs (Supplementary Data 3). Similar to a previous report in cotton40, the motifs of BRR-BPC, C2H2, Dof, ERF and TCP families, which were fundamentally important for plant growth and development, were highly enriched in all profiled tissues (Fig. 1f and Supplementary Fig. 6). Specifically, the enrichment of SBP families was high in the ACRs of ear, tassel and SAM, consistent with one of their major roles in establishing meristem boundaries within inflorescence tissues41. Additionally, we found motifs from some WRKY family genes that were overrepresented in roots and mesophyll, a phenomenon consistent with their significantly enhanced transcription in similar tissues42,43.

We further merged the ACRs of all tissues into 80,365 multi-tissue ACRs (~61.4 Mb, ~2.9% of the maize genome), accounting for 2.8 to 4.4-fold of the ACR length in a single tissue. Our estimation approached the maize single-cell ATAC-seq assays (~4%), which integrated data from 56,573 nuclei across six major tissues19. However, this ACR map was apparently not saturated, representing approximately 85% saturation according to the Michaelis-Menten equation analysis (Fig. 1g), indicating a need to further incorporate more specialized tissues or cell-types. ACRs were further classified into constitutive, variable and tissue-specific ACRs (Fig. 1g). The contrasting numbers of variable (n = 46,483) and constitutive ACRs (n = 6475) again suggested the high tissue specificity of accessible chromatins in maize. To further reveal the selective pressure on ACRs, we estimated the SNP density within and flanking ACRs using the data obtained from maize HapMap344. As expected, the constitutive ACRs had the lowest SNP density in their central region (Supplementary Fig. 7), followed by variable and tissue-specific ACRs. This observation was in accordance with their conservation degree. Interestingly, the flanking regions of constitutive and variable ACRs showed a significantly higher SNP density than tissue-specific ACRs. In summary, we generated a map of over 80,000 ACRs in multiple maize tissues, which allowed their evolutionary characterization in subsequent sections.

Evolutionary constraint of maize ACRs in other grass genomes

Comparative genomic studies in plants have commonly characterized the conservation or innovation of protein-coding genes, yet the evolutionary constraint of more widespread regulatory sequences remains less investigated. Based on the multi-tissue ACRs, we employed a Cactus-based comparative approach (Supplementary Fig. 8) to trace their evolutionary trajectory in 34 Poaceae species and 4 outgroups (Supplementary Fig. 9 and Supplementary Data 4). To demonstrate the robustness of our method, we identified the regions in sorghum genome that were highly constrained (alignment coverage ≥ 90%) with maize ACRs, which could successfully reconstruct the two maize subgenomes (Supplementary Fig. 8b). When expanding to other Poaceae species, the number of highly constrained ACRs negatively correlated with their pairwise divergence (Fig. 2a). For instance, 7117, 3592 and 1421 ACRs were highly constrained in sorghum (diverged ~11.1 Mya), foxtail millet (~22.8 Mya) and rice (~49.1 Mya) respectively, exhibiting good consistency with their overall phylogeny (Fig. 2a and Supplementary Data 4). In general, the constraint of ACRs was situated between protein-coding sequences and randomly selected DNA sequences (Fig. 2a), suggesting an accelerated evolution of ACRs over protein-coding sequences in crops.

Fig. 2: Interspecific constraint of maize ACRs in grasses.
figure 2

a The proportion of constrained peptides (blue), genes (purple), ACRs (green) and randomly selected sequences (red) with the increasing of pairwise divergence time (Mya) between maize and other non-maize genomes. The error bands indicate 95% confidence intervals. b Distribution of maize ACRs according to their constraint in other grass genomes. N1 denotes the number of grass species in which ≥ 90% of maize ACRs can be aligned. N2 denotes the number of grass species in which ≤ 10% of maize ACRs can be aligned. Three distinct groups (highly constrained, moderately constrained and maize specific) are highlighted in squares. c The distribution of genomic features of six groups including highly constrained, moderately constrained, maize specific, class I maize specific, class II maize specific ACRs and randomly selected ACRs as control. d Barplot showing the composition of different TE superfamilies in different categories of ACRs as shown in (c). e UMAP projection of all 80,365 ACRs based on their constraint level in each of the non-maize genomes. Each point is one ACR and the colors represent ACR groups. f The chromatin accessibility of ra1 (Zm00001d020430) and its ~40 kb flanking region. The red region indicates a maize specific ACR. The structure of genes is shown in the bottom track. g Tissues specificity of genes targeted by highly constrained ACRs and maize-specific ACRs across 12 tissues. The P-values were calculated by two-sided Wilcoxon test. Tau was used as the index of tissue specificity. The index varied from 0 (consistently expressed genes) to 1 (strongest tissue-specificity of gene expression). The box denotes the maxima, 25th, median, 75th percentiles and minima, and the whiskers indicate the 1.5× interquartile range. Source data are provided as a Source Data file.

We then projected ACRs into a planar map (N1-N2, Fig. 2b) according to the numbers of highly constrained species (N1) and the number of species without constraint (N2, alignment coverage ≤ 10%). It was worth noting that only 1% (n = 1,070) of the maize ACRs were highly constrained in most Poaceae species (N1 ≥ 20) (Fig. 2b), in contrast to the high conservation in humans that approximately half of CREs were constrained in primates. In addition, a substantial number of ACRs (~36.2%) specifically existed in maize (N2 ≥ 34) (Fig. 2b). These data implied an unprecedented high genetic diversity of regulatory sequences in grasses, which further prompted us to explore the driving forces behind maize specific ACRs. We found that a profound proportion (~65%) of maize specific ACRs were located in distal regions (Fig. 2c), while both highly and moderately constrained ACRs were exclusively located (> 90%) in genic regions. Meanwhile, nearly half of the sequences in maize specific ACRs were annotated as TEs (Fig. 2d), particularly Ty3 elements, contrasting with the rare TE composition in highly and moderately constrained ACRs. These data highlighted the major role of TEs, mostly residing in regions distal to genes, which were involved in the innovation of maize specific ACRs.

To further characterize the interspecific constraint of ACRs, we applied the Uniform Manifold Approximation and Projection (UMAP) algorithm to all ACRs based on their alignment coverage in each of the non-maize genomes (Fig. 2e). The constrained (N1 ≥ 10) ACRs were largely clustered on the left panel, while the maize specific ACRs formed two distinct groups, one (class I, n = 9384) scattered on the right panel, and the other (class II, n = 19,722) forming an isolated island. We found almost no sequence homology of class II ACRs in other Poaceae genomes, compared with sparse homology of class I ACRs. Although class II ACRs exhibited higher affinity with TEs than class I ACRs, more than half of both class II and class I ACRs did not originate from TEs (Fig. 2d). These non-TE maize specific ACRs might result from sequence mutations that had accumulated due to genome redundancy after the whole genome duplication of maize.

Genes regulated by highly constrained or species-specific ACRs are likely to encode for traits of conservation or innovation, respectively22. Using our previously generated Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PETs) data45, we identified these two different sets of genes (n = 2583 and 11,735), many of which could not be identified by the simple principle of proximity (Supplementary Fig. 10). Genes targeted by highly constrained ACRs were found to be enriched in photosynthesis pathways including NADPH: quinone reductase activity and response to light stimulus (Supplementary Fig. 11a). They also participated in the transmembrane transport of some polyols (Supplementary Fig. 11a), including mannitol, which served as a major product of photosynthesis46, as well as xylose which could play a crucial role during the formation of plant cell walls47,48 to mediate plant response to abiotic stresses (Supplementary Fig. 11a). On the other hand, genes (n = 11,735) regulated by maize specific ACRs were primarily enriched in plant immune or defense response pathways as was commonly observed for PAV genes49,50 (Supplementary Fig. 11b). They include jasmonic acid and salicylic acid metabolic processes (Supplementary Fig. 11b), which potentially regulate plant resistance against pathogen invasion51 during the continuous evolution of plant immune system52. Meanwhile, genes regulated by maize specific ACRs were also enriched in flower development and morphogenesis pathways (Supplementary Fig. 11b). One special example was ramosa1 (ra1), a pivotal gene that regulates maize inflorescence branching architecture53 and specifically expressed in immature tassel and ear in our profiled tissues (Supplementary Fig. 12). Notably, only one ACR was identified in both the gene body and 20-kb flanking regions of ra1 (Fig. 2f), which was likely a decisive regulatory locus of ra1. This ACR was maize-specific since it had no constraint in sorghum and other Poaceae species. Therefore, it might function through ra1 to contribute to the substantial changes in the inflorescence architecture of maize compared to other grasses.

We further investigated the expression profiles of the aforementioned two sets of genes to find more clues about their roles in maize speciation. Although genes regulated by highly constrained ACRs exhibited an overall higher expression level (Supplementary Fig. 13a), those regulated by maize specific ACRs, by contrast, exhibited significantly higher tissue specificity (Fig. 2g). This result was consistent with previous findings that tissue-specific expressed genes usually evolved faster54 and were more likely to be affected by speciation and gene duplication55. We further utilized an independent dataset that applied the phylogenetic approach to estimate the age of genes in Poaceae56, and found a higher proportion of young genes among the genes regulated by maize specific ACRs (Supplementary Fig. 13b). Collectively, we demonstrated the massive number of maize specific ACRs that primarily derived from the activity of TEs, and that genes they regulated might be involved in the innovation of some maize specific traits, such as inflorescence development and immune response.

Intraspecific constraint of ACRs and their association with maize domestication

The recent release of diverse maize reference genomes14,57,58 offers us an opportunity to further explore the intraspecific constraint of maize ACRs. By adopting the same comparative approach, we identified highly constrained ACRs in a total of 53 non-B73 maize genomes (Fig. 3a and Supplementary Data 5). We found that approximately half of the ACRs were fixed in nearly all maize genomes (Supplementary Fig. 14a). This ratio was comparable to the estimation of the interspecific constraint of CREs (~50%) in primates21,22, and accorded with a previous estimation of genetic diversity in maize, which was comparable to or even higher than the divergence between human and chimpanzee59. On the other hand, only ~7% (n = 5466) of ACRs were constrained in fewer than 10 maize genomes (Fig. 3a), among which 337 were specifically found in B73. It seemed that the conservation of ACRs we estimated here was stronger than genes, since a pan-genome analysis of the maize NAM population had reported a higher degree of gene variability14 (~70% were dispensable or private genes). To clarify, we simultaneously estimated the constraint of ACRs and genes in our panel. We found apparently stronger constraint of genes over ACRs, and their constraint were highly correlated (r = 0.98) with only a few exceptions (e.g., LH244 and PH207) that the constraint of ACRs was slightly stronger than genes (Supplementary Fig. 14b). Interestingly, ACRs that constrained intraspecifically exhibited quite different genomic properties compared to interspecifically constrained ACRs. For instance, TEs were common in intraspecifically constrained ACRs (Fig. 3c), while rare in interspecifically constrained ones. In addition, a considerable proportion of intraspecifically constrained ACRs were found in distal regions (Fig. 3b), while almost no distal ACRs was found in interspecifically constrained ones. These findings underscored the importance of historical TEs shared by the common maize ancestors, which might reside in regulatory regions and subsequently be fixed after maize diversification.

Fig. 3: The characteristics of intraspecifically constrained ACRs and domestication-associated ACRs.
figure 3

a Distribution of ACRs in B73 according to their constraint in other non-B73 maize genomes. b The distribution of genomic features of five ACR groups including highly constrained, moderately constrained, weakly constrained, domestication-associated ACRs and randomly selected ACRs as control. c Barplot exhibiting the composition of different TE superfamilies in different ACR categories as shown in (b). d The ranking of enriched motifs in domestication-associated ACRs. Different TF families are depicted by colored dots. The P-value of each motif was estimated using one-sided Fisher’s exact test. e The proportions of DEGs among genes targeted by domestication-associated ACRs and randomly selected control ACRs (n = 856). The P-values were calculated by two-sided Wilcoxon test. The error bars represent the average value ± SD. f The chromatin accessibility of bHLH68 (Zm00001d025752) and its ~90 kb flanking region. The red region indicates domestication-associated ACR; the blue region indicates the target gene; the red and blue lines represent chromatin interactions between domestication-associated ACR and bHLH68 identified by ChIA-PETs and HiChIP, respectively. The structure of genes is shown in the bottom track. Source data are provided as a Source Data file.

Genome comparison between wild teosinte and maize had uncovered hundreds to thousands of candidate genes associated with maize domestication5,7. Still, little is known regarding the regulatory sequences involved in this process. Based on the classical case of tb1, which was a gain-of-function Hopscotch insertion that was fixed in maize while being absent in teosintes5, we hypothesized that ACRs that were highly constrained in most maize genomes but were absent in the majority of wild teosintes60 (≥ 3 of 5 analyzed teosinte genomes (3 Zea mays ssp. parviglumis61,62 and 2 Z. mays ssp. mexicana62)), should play a pivotal role in rewiring transcriptional networks during maize domestication. Under this scenario, we could successfully validate the ACR that colocalized with tb163 (Supplementary Fig. 15a). On a genome-wide scale, we identified 856 ACRs potentially involved in maize domestication. Interestingly, compared with all highly constrained ACRs in maize, this small group of highly-constrained and domestication-associated ACRs showed uncanonically higher TE composition (Fig. 3c), as well as a profound ratio within distal non-genic regions. These results once again highlighted the prevalence and importance of TEs, particularly those located distal to genes during maize domestication. We were interested in the TFs involved in regulating these domestication associated ACRs and further predicted their TFBSs. We found significant motif enrichment of TCP family (Fig. 3d), which was accorded with the identity of master regulator gene tb1 of maize domestication. In addition, we also discovered the enriched motifs of C2H2 family including ramosa1 gene, as well as ERF and bZIP families that were fundamentally important during plant development.

To offer more functional clues into these domestication-associated ACRs, we again utilized the ChIA-PETs data to find a total of 463 genes that entangled with these ACRs through chromatin interaction. Notably, we noticed significant enrichment of these genes in another independent domestication gene list (Supplementary Fig. 15b), which was identified by scanning SNPs between teosinte and maize populations64. Moreover, among the differentially expressed genes (DEGs) identified in three major tissues between teosintes and maize65, we found a significantly higher ratio (17.5%, Supplementary Data 6) of DEGs within these interacting genes over randomly selected genes without interactions (12.4%, Fig. 3e). As an example, we identified an ACR located ~51 kb upstream of bHLH68, also known as phytochrome-interacting factor1 or ZmPIF1, which could play versatile roles in regulating the growth, development and drought tolerance of maize66. Chromatin interactions between this ACR and bHLH68 can be found not only in our ChIA-PETs data but also in other 3D genomic datasets (Fig. 3f), and this ACR was fixed in multiple maize lines while absent in teosintes, which correlated well with its up-regulation in maize over teosintes (Supplementary Fig. 12). A similar case was also found for a domestication-associated ACR and its interacting gene cmu2 (Supplementary Fig. 15c), which encoded a chorismate mutase with elevated expression in maize ear and leaf and might be involved in plant pathogen infection resistance67 (Supplementary Fig. 12). In summary, our results demonstrated the great promise of analyzing evolutionary constraint of ACRs to identify candidate regulatory loci associated with crop domestication and subsequent adaptation processes.

Constraint and dominance of ACRs between two maize subgenomes

It is well documented that genes in one maize subgenome always dominated expression over genes in the other68, a phenomenon known as subgenome dominance. However, it has been less investigated for chromatin accessibility. Using 3719 high-quality genes pairs with exact 2:1 synteny to sorghum genes (Supplementary Fig. 16), we reconstructed two maize subgenomes (m1, 1,181 Mb; m2, 726 Mb) and classified ACRs into m1 (n = 47,666) and m2 (n = 28,442), respectively (Supplementary Data 7). Notably, the numbers of highly constrained ACRs in m1 (n = 4647, 9.7%) and m2 (n = 2326, 8.1%) when compared to sorghum were higher than the highly constrained ACRs when mutually compared between m1 and m2 (~1200, Fig. 4a, b), a phenomenon accorded with the well-known reciprocal subgenome fractionation observed for genes68. We then characterized key metrics of ACRs between the two subgenomes, and observed that both the number and density of ACRs in m1 were higher than in m2, although their overall differences were subtle (Fig. 4c and Supplementary Fig. 17a). Accordingly, the transcriptional differentiation was also weak when profiled across the two subgenomes (Supplementary Fig. 18a). However, when we focused on these 3,719 pairs of syntenic genes, the dominance of both chromatin accessibility (Fig. 4d and Supplementary Fig. 17b-c) and gene transcription was significantly enhanced (Supplementary Fig. 18b). Meanwhile, the chromatin accessibility of these conserved regions was also much stronger than the remaining less-conserved regions (Fig. 4c, d). These results suggested the importance of chromatin accessibility in differentiating duplicated gene loci to introduce subsequent differentiation of gene expression69.

Fig. 4: Characteristics of ACRs between two maize subgenomes.
figure 4

a The number of constrained ACR on m1 and m2 when compared to sorghum and when mutually compared. b A schematic of reciprocal subgenome sequence loss happened on ACRs. Straight lines represent genome sequences; gray dashed boxes represent regions of collinearity between genomes; and triangles represent ACRs on genomes. c ACR length density normalized by the lengths of m1 (red) and m2 (blue) across 15 samples. d ACR length density of highly conserved region normalized by the lengths of highly conserved regions of m1 (red) and m2 (blue) across 15 samples. e The proportion of enriched TFBSs that shared between two subgenomes and specifically existed in m1 or m2 subgenome (n = 15). Except for the TFBSs that annotated across the two subgenomes, we also independently annotated TFBSs in promoter and distal ACRs, respectively. The box denotes the maxima, 25th, median, 75th percentiles and minima, and the whiskers indicate the 1.5× interquartile range. The P-values were calculated by two-sided Wilcoxon test. f An overview of ACRs across a ~ 400 kb region between tb1 and its m2 syntenic gene tcp6. Red region indicated the reported functional Hopscotch region; the blue box indicates the genes; the grey bar indicates the transposable elements. Source data are provided as a Source Data file.

To further explore accessible chromatin in specific duplicated loci, we focused on teosinte branched1 (tb1, m1) and its m2 paralog tcp6, both of which encoded the TCP family transcription factors. Although gene density and chromatin accessibility were fairly high in up- and downstream regions of tcp6, no ACRs were identified in its gene body and 10-kb flanking regions (Fig. 4f), which was consistent with its silenced expression in major maize tissues (Supplementary Fig. 19a). In contrast, tb1 was exclusively activated in ear, which might be attributed to the functional Hopscotch insertion that showed colocalization with two independent ACRs in our data (Fig. 4f). We also investigated other two syntenic gene pairs, ub2 (m1) and ub3 (m2), gt1 (m1) and Zm00001d048172 (m2), both were crucial in regulating plant architecture70,71. Cases of ub3 and gt1 were similar to tb1 that their causative sequence insertions, named KRN47 and prol1.171, exhibited accessible chromatin in most of the profiled tissues (Supplementary Fig. 19b and 19c). However, no parallel functional insertion was found near ub2 and Zm00001d048172, and consequently their transcription was weakened or completely silenced (Supplementary Fig. 19a). These data suggested that the insertion of subgenome-specific functional sequences, mostly TEs, might be one of the driving forces underlying the subgenome differentiation of chromatin accessibility and gene expression.

We next sought to decipher the divergence of TFBSs between the two maize subgenomes. Unlike genes and ACRs that exhibited strong fractionation68 (Fig. 4a), approximately 80% of the TFBSs (69.1% to 86.3% across the profiled tissues) could be identified in both subgenomes (Fig. 4e and Supplementary Fig. 20a). However, the divergence of TFBSs enhanced in both promoter and distal ACRs (Fig. 4e and Supplementary Fig. 20b and 20c), e.g., only one-third of the TFBSs in distal ACRs of seedling were shared between two subgenomes. Therefore, the strong conservation of TFBSs between two subgenomes might be attributed to the complementation of ACRs from different functional regions. Additionally, we found a higher number of subgenome-specific TFBS in m1 than m2, which was accorded with the general principal of m1 dominance over m2 (Fig. 4e and Supplementary Fig. 20a-c). For instance, the TFBSs of the GLK family, which regulates the earliest photosynthesis steps by activating genes encoding chloroplast-localized and photosynthesis-related proteins72, were exclusively found in m1 (Supplementary Fig. 21). Interestingly, the genes that encoded these GLK TFs also existed in m1, indicating their regulatory functions that preferentially exerted within the same subgenome. These results indicated that the regulomes of the two maize subgenome were largely shared except for a few TF families that specifically functioned within a single subgenome to precisely regulate the development of particular tissues.

Trait-associated SNPs are enriched in ACRs that highly constrained in maize

In species with complex genomes, such as maize29 and humans73, trait-associated SNPs (TASs) identified by genome-wide association studies (GWASs) often fall into non-coding regulatory regions73. We anticipated a better annotation of maize TASs using our ACR map. Therefore, we first performed a GWAS of 15 complex traits in a maize-associated mapping panel (n = 260) that was genotyped across HapMap3 SNPs, resulting in 110 to 363 TASs (minor allele frequency > 0.05, P-value < 10_5) for each trait (Supplementary Fig. 22 and Supplementary Table 1). We then integrated all TASs (n = 2984) to improve their statistical power when assessed with ACRs. As expected, TASs were overrepresented in ACRs with their density being 1.64 (cob-weight) to 6.49-fold (leaf-length) higher over the remaining non-ACR regions (Fig. 5a). In addition, the overall density of TASs in m1 ACRs was higher than in m2 ACRs (Fig. 5b), indicating a predominant role of the m1 subgenome in the genetic regulation of maize complex traits. Interestingly, although all the traits we investigated were believed to be classical quantitative traits, six out of them (for example, upper-leaf-angle and days-to-tassel) exhibited non-canonical results with higher TAS densities in the ACRs of the m2 subgenome (Fig. 5b). The existence of hotspots with an excessive number of TASs (Supplementary Fig. 23a and b), which usually entangled with a strong LD (linkage disequilibrium) and can be mapped within the m1 or m2 subgenomes for different traits, might provide one of the appropriate explanations for these observations.

Fig. 5: The enrichment of TASs in ACRs.
figure 5

a The ratio of TASs_in_ACRs/TAS_in_non-ACRs and ACR length/non-ACR length across the whole genome, m1 and m2 subgenomes for 15 complex traits. b The TASs density in ACRs of in m1 and m2 subgenome across 15 traits and those that merged into a non-redundant TASs. c The TASs density in 4 different degrees of intraspecifically constrained ACRs. d An overview of ATAC-seq signals and HiChIP interactions (red indicates H3K4me3 and grey indicates H3K27m3) of a non-coding GWAS locus. The position of lead SNP is indicated by light red shading. Regional Manhattan plots of this significant SNP associations for Days_to_tassel and Days_to_silk traits are shown in the right panel. Source data are provided as a Source Data file.

Research in humans has revealed that TASs of disease-related or other complex traits are more likely to be enriched in interspecific constrained CREs21,22. However, due to low and limited number of interspecifically constrained ACRs in Poaceae (Fig. 2b), the power to investigate maize TASs in ACRs with a varying degree of interspecific constraint was statistically weak. However, we observed that the TASs were overrepresented in ACRs with higher intraspecific constraint (Fig. 5c). Specifically, the density of TASs in ACRs that highly constrained in most maize genomes exhibited 1.4-fold increase compared with ACRs with weak constraint in maize. We further verified this result (Supplementary Fig. 24) by using another independent TASs dataset (n = 2360) that also mapped on maize B73 V4 genome and tested in an ultra-large maize population with more than 1000 lines74. These findings suggested that the complex traits in maize are primarily driven by the genetic variants in ACRs, especially for those fixed in diverse maize populations.

We finally attempted to annotate some GWAS loci that resided in ACRs. As an illustration, we identified a lead SNP (chr3: 215,791,183) that significantly associated with both days-to-silk and days-to-tassel, which fell into a non-coding region that was ~5.3 kb away from the nearest gene. Notably, this lead SNP colocalized with an intraspecifically highly constrained ACR accessible in three reproductive tissues (ear, tassel, and SAM) (Fig. 5d), suggesting its potential role in regulating the genes involved in flowering. Further incorporation of 3D chromatin interaction data, such as the HiChIP data (Fig. 5d), could further pinpoint to a list of candidate genes likely interacting with this ACR, although the exact function of these GWAS loci and interacting genes need further experimental validations. To end up our work, we integrated the multi-tissue ATAC-seq data with our recently established HEMU online database (https://shijunpenglab.com/HEMUdb/databrowse)75 to facilitate easy visualization and analysis of any loci that recorded by B73 V4 genome coordinates.

Discussion

With the rapid advancement in plant genomics, we are on a highway to enter the post-genomic era, with primary research goals progressively shifting from gene-centric studies to encompass a wider spectrum of genomic regulatory sequences. In this work, we generated a comprehensive map of accessible chromatin by integrating the ATAC-seq data of 12 major maize tissues. Although ATAC-seq has been commonly used by many studies on maize to explore the tissue-specificity of accessible chromatins2,14,16,17,18,19, most of them are carried out on a single or relatively few tissues. They include the recently published single-cell atlas of ACRs across six maize tissues, highlighting the strong association between GWAS SNPs and tissue/cell-type specificity of ACRs. To our knowledge, the map in our study might be one of the most comprehensive ACR maps in plants to date76, although it is still far from saturation as we await the incorporation of more ATAC-seq data from distinct tissues, cell types and other non-B73 maize lines14,77. We anticipate the growth of ATAC-seq data in maize and other plant species in the forthcoming years32,33, especially through the application of the single-cell ATAC-seq technology that has shown its unique advantages in decoding cell heterogeneity and the flexibility in obtaining chromatin accessibility of bulk tissues when data from tens of thousands of cells are integrated. Meanwhile, the recently developed long-read Fiber-seq is another promising approach78 for profiling chromatin accessibility in maize and other crops with considerably high ratios of TEs. We also anticipate the continuous development of cutting-edge artificial intelligence algorithms to predict functional CREs in complex crop genomes with the integration of multi-omics data79,80,81, and leverage these epigenetic information to dissect complex traits and improve breeding design.

We performed large-scale comparative genomic analysis to estimate the inter- and intraspecific evolutionary constraint of maize ACRs. Although it is well-documented that highly constrained DNA sequences in a group of related species are usually functional, there are also many circumstances where loss-of-function mutations can occur in these highly constrained regions, for example, large-effect SNPs that can result in pseudogenes. Therefore, ATAC-seq data of other crop species are required to unveil the true extent of chromatin accessibility for any DNA sequence constrained within an ACR in maize. In addition, by analyzing the intraspecific constraint, we also identified several hundreds of candidate ACRs that are fixed in most maize lines while absent in teosintes, which were likely associated with maize domestication. However, compared with these gain-of-function regulatory elements, the effect of the genetic bottleneck happened during crop domestication are more usually to result in loss of beneficial alleles in their ancestors. Therefore, generating similar regulatory maps in wild teosintes82 will help shed new light on these untapped regulatory loci that should also play an important role in the maize domestication process. For high-resolution TF binding motifs or CREs, which typically span tens of base pairs within ACRs, it requires more meticulously designed methods to investigate their evolutionary constraint in order to minimize inherent alignment bias. We also demonstrated the power of utilizing ATAC-seq data to annotate the genetic loci discovered by GWAS in maize, although experimental tools, e.g., transient transcription activation assay or genome editing is required for downstream verification. In summary, our study expands our knowledge of plant evolutionary epigenomics, and offers a wealthy data resource for the excavation and utilization to enhance the genetic improvement of maize.

Methods

Plant materials

The seeds of maize B73 inbred line were planted at the experimental station of Shandong Agricultural University, Tai’an, China in 2021. After 12 days after self-pollination, the embryo, endosperm and pericarp from roughly 50 seeds were separated, then freezen quickly in liquid nitrogen. SAM tissues were dissected from 62-week-old field-grown plants and immediately frozen in liquid nitrogen.

Library construction and ATAC sequencing

Plant tissues were ground into powder in liquid nitrogen, then incubated with lysis buffer at 4 °C for 10 min. After filtration with 40 µm cell strainer, cell solution was transferred to the top 60% percoll solution and centrifuged at 1800 xg for 20 min. The intermediate phase of the centrifuged liquid was extracted and then diluted with 10 mL RSB buffer. Cell nuclei pellets were harvested by centrifuge at 1,000 xg for 10 min and washed with RSB buffer once time. Approximate 50,000 nucleus were added to transposition reaction solution to perform tagmentation. Tn5 transposed DNA were purified by AMPure DNA magnetic beads, then amplified by PCR. The final qualified ATAC-seq libraries were sequenced on the Illumina Novo-seq platform (San Diego, CA, United States) with PE150 mode.

ATAC-seq analysis

ATAC-seq raw data was processed using ATAC-pipe pipeline83. The quality of mapping reads was evaluated using the “--MappingQC” module in ATAC-pipe. After trimming adapter sequences using the script in ATAC-pipe, reads were mapped to Z. mays AGPv4 reference genome using Bowtie2 (v.2.3.0)84. The PCR duplicates were removed using the MarkDuplicates function from Picard tools (v.2.9.2; http://broadinstitute.github.io/picard/). The mapped reads were then shifted +4/-5bp depending on the strand of the reads, so that the first base of each mapped read can represent the Tn5 cleavage position. All mapped reads were further extended to 50 bp centered by the cleavage position. Reads mapping to highly repeated regions, chromosome M and P, and unanchored contigs were removed. Two biological replicates from the same sample were merged using the “--Merge” module in the ATAC-pipe.

Sample clustering and PCA

The clustering among all samples was performed by the “dba.plotHeatmap” function of DiffBind (v.3.4)85 with default parameters. The PCA of all samples was performed using the “dba.plotPCA” function of DiffBind (v.3.4) with default parameters.

Transcriptome data analysis

Firstly, FASTP (v.0.20.1)86 was used to trim adaptors and filter out low quality reads. Trimmed reads were further mapped to the Z. mays AGPv4 reference genome by HISAT2 (v.2.2.1)87 with default parameters. Alignments were subsequently sorted and converted into bam files using Samtools (v.1.12)88. Gene expression was estimated by StringTie (v.2.1.5)89 in the forms of both fragments per kilobase of transcript per million mapped reads (FPKM) and transcripts per kilobase million reads (TPM). Genes with FPKM or TPM > 1 were considered to be expressed in the corresponding samples.

ACR identification

MACS290 was used to call peaks by the “--PeakCalling” module in ATAC-pipe with the parameters of “macs2 callpeak -f BED -g 2.1e + 9 -q 0.01 --nomodel --shift 0”. Tn5 integration site density (insertions per kb) was estimated for each peak, as well as the matched and randomized control regions (established using bedtools (v.2.26.0)91, shuffled specifically excluding ACRs from the randomized selection space). ACRs with Tn5 integration site density less than the 99.9% upper quantile of randomized control regions were removed according to the published method14.

Annotation of TEs and their association with ACRs

Repetitive elements were identified using the Extensive de novo TE annotation (EDTA, v.1.9.6)92 pipeline with the parameters “--species Maize --anno 1”. Next, the unclassified transposons from EDTA was re-annotated using DeepTEs93. Then, TE composition in ACRs was calculated by intersectBed (v.2.29.2)91 with the parameters “intersectBed -wo”.

The insertion time of intact LTRs associated with ACRs (50% TE sequence overlap with ACR) was calculated. Firstly, intact LTRs were extracted by two long terminal repeat sequences and mutually aligned by Muscle94. The distance K with Kimura Two-Parameter approach between LTRs was calculated by distmat program in EMBOSS (v6.6.0)95. The activity of each LTR (T) was calculated by the following formular:

$$T=K/(2\times r)$$
(1)

where “r” refers to a general substitution rate of 1.3 × 10−8 per site per year in the grass family96. RepeatMasker was used to annotate simple repeats in the ACR with the parameters “-no_is -norna”.

TF annotation and motif analysis

TF genes (n = 2,361) in the B73 V4 genome were annotated using Transcription Factor Prediction function of PlantTFDB37 (http://planttfdb.gao-lab.org/). A total of 259 and 45 TF motif models were downloaded from PlantTFDB and JASPAR, respectively. Additional 12 TF motif models were detected via de novo motif identification using the discriminative motif discovery workflow of MEME-ChIP (v5.0.2) with default settings based on DAP-seq data2,39. The motifs of 306 TF genes from 36 families were curated. Finally, the enriched motifs in the interest region of ACRs were identified using the AME v.5.0.2 tool of MEME Suite with the parameters; “--scoring avg --method fisher --hit-lo-fraction 0.25 --evalue-report-threshold 10.0 --control”.

Evolutionary constraint analysis of ACRs

To measure the sequence constraint of ACRs, Progressive Cactus (v2.6.0)97 was used to perform pairwise alignment of maize with other genomes, with the parameters “cactus./js_grass./B73.othergenome.txt./ B73.othergenome.hal --consMemory 320 G --binariesMode local --stats”. After obtaining the hal file, the hal-to-chains function of Cactus was utilized to convert into chain file with the parameters “cactus-hal2chains./js_grass./ B73.othergenome..hal chain/ --refGenome othergenome”. Then, we followed the published code to employed the LiftOver98 software to determine the conservation of ACR sequences from B73 genome in other genomes22. Alignment ≥ 90% was considered as constrained ACRs (see Supplementary Fig. 7a). Phylogenetic relationships among the 34 Poaceae species and 4 outgroups were obtained from TimeTree99.

UMAP analysis on ACRs

For each ACRs, the LiftOver software was used to get the percentages of aligned positions in the other 34 genomes. The resulting matrix of ACRs by 34 genomes was then used as input to run UMAP100 with default parameters.

GO analysis

GO enrichment analysis was performed using GOATOOLS101 with Z. mays AGPv4 GO annotation and go-basic.obo (http://current.geneontology.org/ontology/go-basic.obo). The module “find_enrichment.py” with Fisher’s exact test was used.

Analysis of differentially expressed genes

The published RNA-seq data between 9 teosintes and 6 maize inbred lines (generated by John Doebley’s lab65, NCBI SRA accession number of SRP047528) from ear, leaf and stem tissues were downloaded. For the identification of differentially expressed genes between teosintes and maize inbred lines, we treated the bam files from teosintes (n = 18, 2 replicates from each of 9 lines) and maize inbred lines (n = 12, 2 replicates from each of 6 lines) as replicates, then used Cuffdiff102 to calculate the FPKM in teosinte and maize. Genes with the fold-change of FPKM > 2 and q-value < 0.05 were identified as DEGs between teosintes and maize.

Maize two subgenome identification

The two subgenome of Z. mays AGPv4 were identified according to our published pipeline103. Briefly, we used the Synmap pipeline in CoGe (https://genomevolution.org/CoGe/SynMap.pl). We used last to blast104 the CDS sequences between maize and sorghum, then detected syntenic blocks with DAGchainer105 with options -D 20 -A 5. QuotaAlign106 was further used to merge adjacent syntenic blocks. The syntenic depth was set to 2:1 for maize and sorghum comparison, and the overlapped distance was set to 40 to permit overlapped syntenic regions. Fractionation bias was applied to determine subgenome organization in maize compared with sorghum. For each pair of chromosomes, the copy with a greater number of unique genes was assigned to the maize1 subgenome, whereas the pair with fewer uniquely retained genes was assigned to the maize2 subgenome. The CodeML utility in the PAML107 software package was used to calculate the non-synonymous (Ka) and synonymous (Ks) rates between orthologous genes.

High-quality syntenic gene pairs between the two maize subgenomes

According to the above COGE output file of QuotaAlign results, the gene pairs in maize having 2:1 relationship with sorghum genes were extracted, and among those with Ks > 2 were further removed (potential pre-grass homeologs). Finally, remaining gene pairs were assigned to the m1 and m2 subgenomes based on the result of subgenome classification.

Genome wide association studies (GWAS)

Plink108 was used to convert vcf files to bed files. Then the sites with missrate > 0.5 or MAF < 0.05 were filter out. GCTA109 was used to perform PCA. The kinship matrix was obtained using emmax-kin for calculating genetic relatedness. EMMAX110 was used to conduct GWAS analysis and correcte for population structure by including the top 20 principal components (to account for population structure) and the kinship matrix (to correct for genetic relatedness and other fixed effects).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.