Background & Summary

Maize (Zea mays ssp. mays) is known for its large genetic diversity, which allowed the species to adapt to a multitude of environments, including tropical and temperate climates. Maize is now grown throughout the world and is the cereal with the highest production worldwide1. Its extensive genetic and phenotypic variation has also been the foundation of modern hybrid breeding. In the U.S., complementary heterotic groups within the dent germplasm - Stiff Stalk Synthetic and non-Stiff Stalk Synthetic, including Lancaster and Iodent lines - have been developed to generate highly productive hybrids, while in Europe, heterotic effects between dent and flint lines have been exploited to develop productive hybrids adapted to cooler climate. In addition to its role as a major food crop, maize is also a model organism in biology, particularly for genome dynamics, due to its large amount of intra-specific structural variation2 and its massive transposable elements content3,4. The discovery that non-coding polymorphisms contribute significantly to a wide range of phenotypic traits5 also led to the establishment of maize as a model for the study of gene expression regulation6,7,8, including the integration of cis-regulatory elements into gene regulatory networks9. Characterizing the genomic diversity of maize is essential for understanding the contribution of structural variants to this diversity, and is a prerequisite to underpinning the functional variation underlying phenotypic variation. Near complete high quality chromosome-scale genome assemblies are critical resources to address these questions.

Despite this wide genetic diversity, for decades, most knowledge about the genomic structure and function of maize has been obtained from a single genotype, B73, an American temperate dent line, therefore representing only a subset of the genetic variability and biology of the species, with a bias towards genetics of the Stiff Stalk Synthetic germplasm. In the past years, efforts have been made to de novo assemble full genome sequences of several other maize lines10,11,12,13,14, including flint material of interest for Europe15,16. While providing first insights into maize structural variation, these studies nevertheless remained limited in characterizing the maize pangenome, as they were generated by different laboratories, using different assembly and annotation strategies. This issue has been overcome by the production of a pangenome analysis of a set of 26 founder inbred lines representing a large fraction of maize diversity, including lines from temperate, subtropical and tropical origin, as well as lines from sweet corn and popcorn germplasm17. The production of high-quality assemblies with high contiguity over repetitive regions revealed large amounts of structural variants. Although most of the variants discovered were in high linkage disequilibrium with SNPs, over 6% of the genomic regions found associated with phenotype were solely detected with structural variants and not with SNPs, indicating their biological relevance and their agronomic value. The cumulative number of pan genes found from this set of 26 lines did not reach a plateau, highlighting the need to explore more extensively genome sequences of the maize germplasm to discover the entire set of maize genes. In particular, the absence of flint material in this dataset hampered a global analysis of the maize germplasm and likely caused an under-appreciation of maize genetic variation. This also limits the use of this pangenome for breeding programs using flint material.

In this study, we expand the current collection of maize whole-genome assemblies by generating high-quality PacBio HiFi-based assemblies for 29 key inbred lines of major relevance to European breeding programs. These include Northern and European flint lines used for adaptation to Northern European climates, inbred lines derived from European landraces of tropical origin, and American dent lines that complete the diversity of the 26 American founder lines (see Table 1).

Table 1 List of inbred lines with genotype information.

Methods

Sample collection and genomic DNA extraction

Plants were grown in standard conditions (growth chamber) up to emergence, then moved to obscurity for 2 to 5 days. Young etiolated leaf samples were flash frozen in liquid nitrogen upon collection. Leaf DNA extractions were carried out using three different protocols: EZNA SQ plant kit (Omega, D3095), Mayjonade et al.18 and Nucleobond HMW DNA Kit (Macherey-Nagel, Ref: 740160.20). The protocol used was tracked for each sample and can be found in the DNA samples metadata. DNA was quantified using the Qubit fluorimetry system, with the High Sensitivity kit (Thermo Fisher, Q32854). Fragment size distributions were assessed using the Agilent Fragment Analyzer. Purity measurements were performed using a Thermo Fisher Nanodrop system, thus ensuring absence of contaminants.

Genome sequencing

Generation of HIFI reads using PacBio Sequel II - CCS

Library preparation was performed according to the manufacturer’s instructions “Procedure & Checklist Preparing HiFi SMRTbell Libraries using SMRTbell Express Template Prep Kit 2.0 or 3. 0”. 5 to 10 μg of DNA was purified and sheared to reach 20kb size using the Megaruptor3 system (Diagenode). Size selection with a 10–15 kb cutoff was performed on the BluePippin Size Selection system or the Pippin HT system (Sage Science). Libraries were sequenced on 2 to 4 SMRTcells on a Sequel II instrument with a 2 hours pre-extension and a 30 hours movie, aiming to reach a 25X HIFI reads genome coverage.

Hi-C library preparation and sequencing

Hi-C libraries were prepared from the F2, F4, F252 and MBS847 samples, using isolated nuclei as starting material. The nuclei were obtained from 1g of young leaves, following the method described in Workman et al.19. All nuclei obtained where then fixed in 1.5% formaldehyde and used to perform Hi-C using the Dovetail Hi-C Kit according to the manufacturer’s protocol (Ref: DG-HiC). Briefly, fixed in situ chromatin was digested with DpnII, DNA ends were labeled with Biotin and proximity ligation was performed. After reverse-crosslinking, 1 μg of purified DNA was then sheared to reach a mean fragment size of ~550 bp (Covaris) and used to build a sequencing library using Illumina adapters. Biotin-containing fragments were isolated using M280 streptavidin Dynabeads (Invitrogen) before PCR enrichment of the library (10 PCR cycles). The libraries were sequenced on an Illumina NovaSeq6000 platform to generate 2 × 150 bp pair-end reads, producing a minimum of 48 Gb of Hi-C read data per library.

Genome sequence assembly and validation

Genome sequence assemblies were performed in two consecutive steps, first building contigs from HiFi reads, then organizing these contigs into chromosomes. For a first set of 4 lines, contigs were scaffolded using Hi-C data. These lines were chosen to represent material with various degree of relatedness to B73: two non stiff stalk lines belonging to two different subgroups (F252 and MBS847), and two flint lines representing European flints (F2) and Northern flints (F4). We observed no major rearrangements as compared to B73 for any of the assembled genome sequences (see Supplementary Fig. 1 for a genome comparison illustration using D-GENIES20), and all these were included within contigs. This indicates that our contig length was large enough to ensure good scaffolding using B73 as a reference. We therefore generated reference-guided assemblies for all other inbred lines using B73v5 sequence as reference.

Contig assembly

HiFi reads were assembled in contigs with hifiasm21 version 0.16.1 using default parameters. Contig assembly metrics were generated using the assemblathon_stats.pl script found at https://github.com/KorfLab/Assemblathon.

Contig scaffolding

For F2, F4, F252 and MBS847 lines, Illumina Hi-C reads were aligned onto the contigs with Juicer22, and contigs were scaffolded with 3D-DNA23. Resulting contact maps were manually corrected with Juicebox24. For all three software packages, default parameters were used. Read quantity, read coverage and Hi-C link metrics are presented in Table 6. For all other maize lines, contig sets were scaffolded with ragtag25 version 2.0.1 using default parameters, using the Zm-B73-REFERENCE-NAN-5.0.fa sequence as reference, downloaded from the NCBI website https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_902167145.1/. For each maize line, contigs were organized into 10 pseudo-chromosomes, with unplaced contigs corresponding to only 0.9 to 7.2% of the assembly total length.

Scaffold validation

Scaffold metrics were produced using the assemblathon_stats.pl script26 and the BUSCO (Benchmarking Universal Single-Copy Orthologs)27 metrics with version 5.1.2 using the poales_odb10 lineage. Kmer completeness and sequence quality value of the scaffolds were assessed using Merqury28 version 1.3 with default parameters.

SNPs and structural variants detection

SNPs and structural variants were detected from the raw HiFi reads, aligning the fastq reads from each maize line to the maize reference assembly B73_RefGen_v4 using pbmm2 (https://github.com/PacificBiosciences/pbmm2) with the CSS preset flag. SNPs were detected using DeepVariant (1.3.0) using default parameters (see snp_detection rules in https://github.com/SeqOccin-SV/SeqOccinVariants). Structural variants were detected using the Sniffles29 (https://github.com/fritzsedlazeck/Sniffles) in a two round process. Sniffles was first used to detect variant on an individual basis with the following parameters (–minsupport 12 –minsvlen 100 –max-splits-base 2 –max-splits-kb 0 –min-alignment-length 5000 –minsvlen 20) with default values for the other parameters. The resulting vcf files were filtered to keep only variant with PASS filter and merged using the jasmine software30. BND (breakend) and TRA (translocation) variants were filtered out and the merged SVs were provided as input (–genotype-vcf) to Sniffles along with the BAM files on each individual line, leading to a set of SV genotyped on all the individuals (see Fig. 1).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Quantities of structural variants detected for each inbred line as compared to B73.

Data Records

Reads and assembled genome sequences were deposited in European National Archive under bioproject PRJEB6781231, (see Tables 26 for details). SNPs and structural variants data were deposited in the European Variant Archive (Study ID: PRJEB106599)32 and in the ‘Recherche Data Gouv’ repository: https://doi.org/10.57745/7AUTOL33.

Table 2 Read sets accessions and statistics.
Table 3 Genome assembly: contig metrics.
Table 4 Genome assembly: scaffold metrics.
Table 5 BUSCO and merqury scores.
Table 6 Hi-C metrics. Cov.: coverage, V.i.: Valid interaction.

Technical Validation

We produced about 2.1 to 6.9 million reads per maize line, with an average read length ranging from 12 kb to 22 kb (Table 2). These high quality HiFi reads were first used to assemble the genomes into contigs, with contig number per maize line ranging from 260 to 3084 (average 1221.1, see Table 3) and N50 contig lengths ranging from 11.8 Mb to 166.0 Mb (average 87.1 Mb, see Table 3). For each maize line, chromosome-scale scaffolds were obtained, with cumulative size of assembled chromosomes ranging from 2.18 Gb to 2.35 Gb (Table 4), in line with the genome sizes expected for maize. As anticipated, tropical lines had larger genome sizes (2.32 Gb) than temperate lines (2.25 Gb). Scaffold N50s range from 219.5 Mb to 253.8 Mb, with L50 from 4 to 5. (Table 4). To ensure the quality, integrity and accuracy of the assembled chromosome sequences generated, we carried our several validation approaches.

Completeness of genome assemblies was evaluated using BUSCO version 5.1.2 with the poales_odb10 containing 4,896 proteins, as well as with Merqury version 1.3. Metrics per genome assembly are presented in Table 5. For all assemblies, >97% of the BUSCO proteins were complete. Merqury results showed genome assemblies quality values >60 and completeness >96.62%.

To further validate the quality of the genome assemblies generated and the genotypes of the DNA sequenced, we investigated the polymorphisms (SNPs, indels and structural variants >50bp) of each line relative to reference line B73. As expected, the number of variants reflected the genetic distances of maize lines from B73 (Fig. 1). Stiff Stalk Synthetic lines showed the lowest amount of variants (7,290,142 SNPs, 829,336 indels and 68,850 SVs, Supplementary Table 2), with the lowest amounts found for lines of the B73 subgroup (Fig. 1 and Table 7). In contrast, flint lines showed the highest number of variants (14,901,375 SNPs, 1,490,896 indels and 119,558 SVs) (Supplementary Table 1). Lancaster and Iodent lines had intermediate values, with Lancaster having slightly more variants (12,365,784 SNPs, 1,282,139 indels and 107,607 SVs) than Iodents lines (11,995,606 SNPs, 1,257,735 indels and 105,935 SVs) (Supplementary Table 2). Lines of tropical origin showed slightly less variants than flint lines. Finally, a PCA based on the SNPs recapitulated the genetic groups and relationships among the lines (Fig. 2). Altogether, these results indicate the high quality of the sequences generated and the reliability of the seedlots sequenced. They also highlight the relevance of our dataset to improve knowledge on maize structural diversity, and the importance of including flint lines in sequencing programs to leverage the maize pangenome.

Table 7 Number of variants detected for each maize line as compared to inbred line B73.
Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

PCA constructed from the (standardized) genotypes of 1 millions randomly selected SNPs.