Background & Summary

The Brassica genus contains several crop species cultivated as oilseeds, vegetables and condiments. It belongs to the Brassicaceae family, which comprises approximately 4,000 species and 350 genera1,2. B. oleracea and B. rapa are two agronomically important species that are primarily grown as vegetables. These two species diverged from a common ancestor about 4 million years ago. Both species present considerable phenotypic diversity which has been subjected to independent selection, giving rise to numerous morphotypes3,4,5,6. For example, different varieties of B. oleracea account for broccoli, brussels sprouts, cabbage, cauliflower and kale, among others, while B. rapa is grown as Chinese cabbage, pak choi, mizuna and turnip5.

B. oleracea and B. rapa are the diploid progenitors of B. napus, which was formed through their interspecific hybridization and genome doubling7. B. napus, also known as oilseed rape or canola, is one of the most economically important oilseed crops, processed into edible and industrial oil8. However, due to its polyploid origin and extensive human selection, primarily for seed quality traits, its genetic diversity has been severely eroded9. As a result, the diversity present in its parental diploid progenitors is often exploited and introgressed into B. napus to introduce agronomically favorable genes, including disease resistance (R) genes10,11,12,13.

With the recent advent of third-generation sequencing technologies, it is now possible to assemble plant genomes at the chromosome level, facilitating the identification and cloning of genes, including causal R genes, for example for blackleg14,15, clubroot16 and Sclerotinia stem rot17 in Brassica. However, the genome assembly of only a few individuals highlights the inadequacy of single reference genomes in capturing species-wide genetic diversity and therefore the requirement to construct pangenomes. Analysis of the B. oleracea and B. rapa pangenomes revealed that resistance gene analogs (RGAs) are highly affected by presence-absence variation, with 12% and 30%, respectively, forming the variable genomes of B. oleracea and B. rapa18,19. Having multiple genome assemblies which capture the diversity between accessions and morphotypes is therefore crucial in expanding the repertoire of known RGAs which underpins the identification of functional R genes.

Here, we report the construction of high-quality chromosome level genome assemblies for three B. oleracea accessions: B. oleracea ssp. acephala cv. C102, B. oleracea ssp. botrytis cv. Nd125, and a wild type B. oleracea individual ‘Bos01’ from Le Hode (Normandy, France). We also improved the genome assemblies of two previously published B. rapa accessions: B. rapa ssp. narinosa cv. Wutacai4 and B. rapa ssp. trilocularis cv. R50020. When compared with previous versions of the same accession or other accessions of the same morphotype, our assemblies contain, on average, over 13,000 additional gene annotations which include novel RGAs. These assemblies provide a valuable resource for the exploration of novel RGAs which have the potential to contribute toward the improvement of disease resistance in Brassica crops.

Methods

Plant material, DNA extraction, sequencing

One individual of two B. rapa (B. rapa ssp. narinosa cv. Wutacai and B. rapa ssp. trilocularis cv. R500) and three B. oleracea accessions (B. oleracea ssp. acephala cv C102, B. oleracea ssp. botrytis cv Nd125, a wild type B. oleracea ‘Bos01’ individual from Le Hode, Normandy, France) were grown in a greenhouse (16 h of light at 21 °C followed by 8 h of dark at 18 °C). Plants were grown in pots filled with a non fertilized commercial substrate (Falienor, reference 922016F3) and irrigated twice a week with a commercial fertilized solution (Liquoplant Blue, 2.5% nitrogen, 5% phosphorus, 2.5% potassium, w/v). These different accessions were retrieved from the BraCySol Biological Resource Center (https://eng-igepp.rennes.hub.inrae.fr/about-igepp/platforms/bracysol). The collected plant materials were flash frozen and stored at −80 °C. High-quality high-molecular weight (HMW) DNA was generated for each accession from 1 g of young leaves of a single individual using a CTAB extraction followed by a purification using the commercial Qiagen Genomic-tip (QIAGEN, Germantown, MD, USA), as previously described21. HMW gDNA quality was checked on a FemtoPulse system (Agilent), revealing DNA molecules to be over 40 kb. For each accession, a library was prepared using the Native Barcoding Kit 24 V14 - Ligation sequencing gDNA (SQK-NBD114.24). HMW DNA libraries were sequenced on PromethION flow cells. In addition, Illumina DNA sequencing was performed using a NovaSeq 6000 (2*150 paired-end reads). Raw DNA Seq data are available on ENA: PRJEB91561 (B. oleracea ‘Bos01’), PRJEB91565 (B. oleracea cv. C102), PRJEB91569 (B. oleracea cv. Nd125), PRJEB91574 (B. rapa cv. R500), PRJEB91578 (B. rapa cv. Wutacai)22.

De novo assemblies of chromosome level nuclear genomes

The two B. rapa and three B. oleracea nuclear genomes were assembled using the Genoscope GALOP pipeline (https://workflowhub.eu/workflows/1200). Briefly, raw Nanopore reads were assembled using NextDenovo v2.5.1 (Nextomics, https://github.com/Nextomics/NextDenovo). The resulting contigs were first polished with Medaka v1.7.2 (https://github.com/nanoporetech/medaka) using default parameters and Nanopore long reads. These contigs were then further polished with two rounds of Hapo-G v1.123, using Illumina short reads and default parameters. They were finally scaffolded using Ragtag v2.1.024, with either the B. rapa Z1 v225 or the B. oleracea cv. Korso genome26 as the reference, depending on the species. A schematic diagram summarizing the workflow used for de novo assembly of the nuclear genomes is presented in Figure S1.

RNA extraction, sequencing and gene prediction

To aid gene prediction, Illumina RNA-Seq data were obtained for each accession using different organs that were harvested on the same plant (same as the one used for DNA sequencing) at different developmental stages. More precisely, we harvested leaves, roots and stems on plants at the 4–6 leaf stage, and flower buds on mature plants. The different organs were first harvested separately and flash frozen. They were then ground into a fine powder using a mortar and pestle. A similar quantity of powder from the different organs was bulked and used to extract total RNA using the Nucleospin RNA Plus kit (Macherey-Nagel, Germany). The cDNA library was constructed using NEBNext® Ultra™ RNA Library Prep kit for Illumina (New England Biolabs, USA) and Illumina paired-end sequencing was performed on a Illumina NovaSeq 6000 (Azenta Life Sciences, Germany). Raw RNA Seq data are available on ENA: PRJEB91561 (B. oleracea ‘Bos01’), PRJEB91565 (B. oleracea cv. C102), PRJEB91569 (B. oleracea cv. Nd125), PRJEB91574 (B. rapa cv. R500), PRJEB91578 (B. rapa cv. Wutacai)22.

Gene prediction was performed using several reference proteomes: eight from other B. napus genotypes (Westar, ZS11, Quinta, Zheyou7, No2127, Gangan, Tapidor and Shengli)27; Arabidopsis thaliana (proteome ID: UP000006548); B. rapa cv. Z125; B. oleracea cv. HDEM28; and B. napus cv Darmor-bzh29. Regions of low complexity in the genomic sequences were masked using the DustMasker algorithm (version 1.0.0 from the BLAST + 2.10.0 package)30. Protein sequences were aligned to the genome using a two-step strategy. First, BLAT v3631 was used to rapidly localize putative matches. The best hit and all hits with a score ≥ 90% of the best match were retained. In the second step, alignments were refined using Genewise v2.2.032, which accurately identifies intron-exon boundaries. Alignments were retained if more than 75% of the protein length aligned to the genome. Additionally, RNA-Seq short reads (Illumina) were used for four genomes (R500, Wutacai, C102, and Nd125). Reads were mapped to their respective genomes using HISAT2 v2.2.133 with default parameters. The resulting BAM files were used as an input in StringTie v2.2.334, with the–rf option to indicate the orientation of the RNA-Seq libraries. When multiple transcripts were detected for a gene, the most highly expressed one (based on TPM) was selected. GFF files were derived from StringTie outputs to retain only the most highly expressed transcript and to remove single-exon models.

All transcriptomic and protein alignments were integrated using Gmove (https://f1000research.com/posters/5-681), an evidence-driven gene predictor requiring no training. Gmove constructs a graph where nodes and edges represent putative exons and introns extracted from alignments, then extracts paths consistent with the protein evidence, identifying open reading frames. Predicted gene models with more than 50% Untranslated Transcribed Region (UTR) content and with a coding sequence (CDS) length shorter than 300 nucleotides were discarded. All final gene models were renamed according to the MBGP (Multinational Brassica Genome Project) nomenclature. A schematic diagram summarizing the workflow used for de novo assembly of the nuclear genomes is presented in Figure S2. The annotations of the nuclear genes (.gff, mRNA and protein files) are available on the French recherche.data.gouv repository: https://doi.org/10.57745/D21PQM35 (deposited October 2025).

The quality of the genomes was also evaluated based on their Long Tandem Repeat (LTR) composition using LAI version beta3.236. The LTR annotation was performed with LTR_retriever version 3.0.437, which integrated results from both ltrharvest38 (GenomeTools version 1.6.2, using the following parameters: -minlenltr 100, -maxlenltr 7000, -mintsd 4, -maxtsd 6, -motif TGCA, -motifmis 1, -similar 85, -vic 10, -seed 20, -seqids yes) and LTR_finder39 (parallel version 1.3, with the options -harvest_out and -size 1000000). This process followed the recommandations provided at https://github.com/oushujun/LTR_retriever.

Chloroplast genome assemblies

The Illumina DNA Seq data obtained for each accession were also used to assemble the chloroplast genomes of each accession. This was performed using FastPlast v1.2.940 (https://github.com/mrmckain/Fast-Plast). The chloroplast genome assemblies were annotated using the online version of GeSeq41. These assembled and annotated genomes were then validated visually using Geneious Prime 2022.2.2 and the Arabidopsis thaliana chloroplast genome (NC_000932)42 as a reference. A graphical representation of these different chloroplast genomes was obtained using the online OGDRAW v1.3.1 (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html)43. Genome assemblies are available on ENA: PRJEB91562 (B. oleracea ‘Bos01’), PRJEB91566 (B. oleracea cv. C102, PRJEB91570 (B. oleracea cv. Nd125), PRJEB91575 (B. rapa cv. R500), PRJEB91579 (B. rapa cv. Wutacai)22. The annotations of the chloroplast genes (.gff) are available on the French recherche.data.gouv repository: https://doi.org/10.57745/RLHXJH44 (deposited October 2025).

Data Records

All sequencing data and associated materials are available in the European Nucleotide Archive (ENA) under project accession number PRJEB9144622. The identifiers for the raw DNA and RNA-Seq data are: PRJEB91561 (B. oleracea ‘Bos01’), PRJEB91565 (B. oleracea cv. C102), PRJEB91569 (B. oleracea cv. Nd125), PRJEB91574 (B. rapa cv. R500), PRJEB91578 (B. rapa cv. Wutacai). The identifiers for the chloroplast genome assemblies are: PRJEB91562 (B. oleracea ‘Bos01’), PRJEB91566 (B. oleracea cv. C102), PRJEB91570 (B. oleracea cv. Nd125), PRJEB91575 (B. rapa cv. R500), PRJEB91579 (B. rapa cv. Wutacai). For each studied genotype, the accessions and their related bioprojects (DNA and RNA-Seq raw data, as well as nuclear and chloroplast genome assemblies) are summarized in Table S1. The annotations of the nuclear (.gff, mRNA and protein fasta files) are available on the French recherche data.gouv repository at https://doi.org/10.57745/D21PQM35 (deposited October 2025). On the same repository, the annotations of the chloroplast (.gff) genomes are available at https://doi.org/10.57745/RLHXJH44 (deposited October 2025).

Technical Validation

Evaluation of the de novo assembled genomes

For all five accessions, we obtained genome assemblies ranging from 563 to 646 Mb and from 373 to 404 Mb in B. oleracea and B. rapa, respectively. Overall, 91.9 to 98.7% of the sequences were anchored to pseudomolecules. For the two B. rapa accessions, for which a nuclear genome assembly was previously published4,20, our assemblies are far more complete than the previous versions (Table 1), as exemplified from the pseudomolecule size. For the B. rapa R500 genome, for which there was a genome assembly at the chromosome level, we used SyRI v1.5.445 to identify the regions that were better assembled in our genome assembly and observed that our updated version was particularly improved in the highly repetitive pericentromeric regions, but also at the beginning of chromosome A01 (Fig. 1). For B. oleracea, we also compared the metrics of our genomes to those obtained in other accessions belonging to the same morphotype (Table 1). The genomes used for these analyses were the following: C-8 v246; Korso v147; 07-DH-33 v1 and W1701 v126; T09 v1, T10 v1, T18 v1, T21 v1 and T25 v148; and W03 v149. For these comparisons, the metrics were not extracted from the original publication (except for R500 as indicated in Table 1) but were rather calculated from the downloaded FASTA files using an internal tool named fastoche (https://github.com/institut-de-genomique/fastoche). For the B. oleracea genomes, 91.9 to 98.7% of the assembled sequences were anchored to pseudochromosomes.

Table 1 Summary of genome assembly metrics for the five new assemblies reported in this study and previous assemblies of the same accession (B. rapa) or different accessions of the same morphotype (B. oleracea).
Fig. 1
figure 1

Comparison using SyRI45 of the new (this study) and previously published20 B. rapa ssp. trilocularis cv. R500 pseudomolecules.

Taking advantage of RNA-Seq data obtained from various organs for each accession, we predicted protein-coding genes, and observed 66,049 to 71,867 and 52,275 to 54,266 protein coding sequences in our B. oleracea and B. rapa genotypes, respectively. Importantly, 94.97 to 99.49% of these predicted genes are anchored on pseudomolecules.

Our five newly assembled nuclear genomes were also validated by launching BUSCO v5.8.250 with the brassicales_odb12 dataset (July 2025), revealing a gene completeness of 99.3 to 99.4%. The high quality of our genome assemblies can also be observed through the values obtained for the LTR Assembly Index (value over 10, as expected for reference quality genomes). There was no score for the first genome assembly of B. rapa cv. R50020 as the total and intact LTR sequence content was too low in this first draft assembly for accurate LAI calculation.

Using the Illumina DNA-Seq data obtained for our five genotypes, we also assembled, annotated, and graphically represented their chloroplast genomes using FastPlast v1.2.840, as well as GeSeq41 and OGDRAW43 online versions (example in Fig. 2). Their size ranged from 153,364 to 153,365 bp and from 153,036 to 153,464 bp in B. oleracea and B. rapa, respectively.

Fig. 2
figure 2

Graphical representation of the wild B. oleracea ‘Bos01’ assembled and annotated chloroplast genome.

Evaluating and expanding the lists of resistance gene analogs

To evaluate the quality of our nuclear genome assemblies and their utility compared to previous versions or assemblies from other accessions belonging to the same morphotype, we explored their RGA content using RGAugury v2.1.7m51. We predicted 1,382 to 1,411 and 1,703 to 1,912 RGAs in our B. rapa and B. oleracea genotypes, respectively (Table 2, see https://doi.org/10.57745/O2FUQ8 for a list of RGA genes identified in each genome). To explore the variability of the RGA content between assemblies, we used OrthoFinder v3.0.152. OrthoFinder analysis identified at least 122 novel RGAs in Nd125, Bos01 and C102, when compared to other accessions from the same morphotype (Table 3, orthogroups from these analyses can been obtained at https://doi.org/10.57745/O2FUQ853). Additionally, over 200 novel RGAs were identified in both Wutacai and R500 (this study) when compared to their previous versions (Table 3, lists of all the unique RGAs for each newly assembled genome are available at https://doi.org/10.57745/O2FUQ853). For the new R500 assembly, we graphically represented using phenogram54 the distribution of the different types of RGAs along each chromosome, allowing the identification of RGA clusters at different chromosomic regions (Fig. 3a). We also only graphically represented the newly identified RGAs in R500, highlighting that these genes are distributed on all chromosomes but are overrepresented at the beginning of chromosome A01 (Fig. 3b), which was not assembled in the previous version (Fig. 1).

Table 2 Number and class of resistance gene analogs (RGAs) predicted in the five new genomes assembled in this study and comparisons with previous versions of the same accession (B. rapa) or different accessions of the same morphotype (B. oleracea).
Table 3 The number of unique resistance gene analogs (RGAs) in the five new genome assemblies when compared to previous versions of the same accession (B. rapa) or different accessions of the same morphotype (B. oleracea).
Fig. 3
figure 3

The distribution of resistance gene analogs (RGAs) across the new (this study) B. rapa ssp. trilocularis cv. R500 pseudomolecules. The plots includes (a) all RGAs or (b) RGAs that are unique when compared with the previously published20 R500 assembly.