Main

Pea is an Old World crop that was first cultivated about 10,000 years ago in the Fertile Crescent8. About three quarters of the cultivated area is for dry seed, and the remaining quarter is for use as a vegetable, with a total export value of about US $3 billion in export value in 2022 (https://www.fao.org/faostat/en/#data/). Pea also has a minor use as a fodder crop and is often grown in home gardens. The nutritional and environmental advantages of this pulse crop, including the rising popularity of pea protein and its nitrogen-fixing root nodule symbiosis, have been discussed elsewhere9,10.

Pea is predominantly inbreeding, with large flowers and many easily distinguishable characteristics, making it an ideal model for Mendel’s studies of inheritance1,11. The seven variants that Mendel studied in detail were clearly distinguished in the seed catalogues of the time12, representing different agronomic forms, end uses or market types, as they still do today. Mendel’s work on peas was described by Allan Franklin as “The best experiments ever done”13. In addition to its important historical contribution to the development of genetics, pea continues to serve as an excellent plant model system, with approximately 100 pea genes characterized at the molecular level14. However, much remains unknown about the molecular nature of the seven pairs of contrasting traits that Mendel studied, even though the genetic loci were named more than a century ago15. The four cloned genes R (seed shape), Le (plant height), I (seed colour) and A (flower colour) have been characterized for some time2,3,4,5,6,7,16, but the extent and distribution of their natural allelic variation remains poorly understood in a genomic context14,17,18. The gene identities of Mendel’s three other traits, P and V (pod form), Gp (pod colour) and Fa (fasciation), are uncharacterized. Candidates for Gp and P have been proposed on the basis of specific genome-wide association studies (GWASs) and bi-parental mapping studies19,20,21; however, further work was needed to confirm or reject these proposals.

Here we present a sequence-based population genomic map and a trait-based phenotypic variation map, coupled with haplotype–phenotype association analyses across a wide range of traits in one of the world’s major Pisum germplasm collections14. We focus on elucidating the gene identities and genomic context of alleles underlying Mendel’s seven well-known traits. We further demonstrate how this approach can be expanded to uncover the molecular basis of a wide range of agronomic and horticultural characters with breeding potential.

Variation map of a Pisum collection

It is widely accepted that Pisum has two wild species Pisum fulvum and Pisum elatius and two independently domesticated taxa Pisum abyssinicum and P. sativum (Supplementary Table 1). We selected a core diversity panel from the widely used John Innes Pisum Germplasm Collection14. This panel includes 500 accessions, selected using Corehunter 3 on the basis of prior genotyping data22,23. The selection was further supplemented with 130 lines previously chosen for other diversity studies (https://pcgin.org/), parents of mutant and mapping populations, and 67 lines comprising selected P. abyssinicum accessions and all accessions designated P. elatius, P. fulvum or Pisum humile (Fig. 1a). We conducted next-generation short-read whole-genome resequencing for these 697 Pisum accessions, generating approximately 80 Gb of clean reads with an average coverage of around 20× for each accession (Supplementary Table 2). We then built a genomic variation map encompassing 154.8 million high-quality single-nucleotide polymorphisms (SNPs) with respect to the ZW6 assembly18, as well as to the inbred pea cultivar Caméor version 1a17 (Supplementary Tables 3 and 4 and Supplementary Fig. 1). Population genomic diversity analysis reveals the pattern of accession relationships and population structure at high resolution23 (Supplementary Tables 5 and 6). We identify eight major Pisum groups (G1–G8) (Fig. 1b,d), with P. fulvum and P. abyssinicum forming distinct branches within G8. The remaining G8 accessions consist mainly of P. elatius, and G7 includes P.sativum together with other sub-types. P. sativum accessions are widely distributed across the groups G1 to G7. These accessions do not have a simple tree-like relationship, but instead form a reticulated network, reflecting the complex history of crossing and introgression between subgroups (Fig. 1e).

Fig. 1: Genotypic and phenotypic variation with respect to population and genome structure within Pisum.
figure 1

a, Taxa types and other classifications indicated by colour on the right, including wild taxa (P. fulvum, P. elatius and other wild taxa listed in Supplementary Tables 1 and 5) and domesticated taxa (P. abyssinicum and P. sativum), further divided into cultivars, landraces and other, which mostly comprises genetic stocks. The number in brackets denotes the number of accessions for each classification. b, Admixture analyses at K = 3 (average of 5 runs), K = 5 (average of 3 runs) and K = 8 (one run that splits K = 5 groups). Accessions strongly assigned to admixture groups are colour-coded, with grey indicating admixture (see Supplementary Table 5). c, Distribution of phenotypes for Mendel’s seven pea traits, with initials labelled as follows: R (round, pale) versus W (wrinkled, black), seed shape; Y (yellow) versus G (green), cotyledon colour; P (pigmented, purple) versus W (white, pale), flower colour; I (inflated, pale) versus C (constricted, black), pod shape; G (green) versus Y (yellow), pod colour; A (axial, pale) versus T (terminal, black), flower position; and T (tall) versus D (dwarf), internode length. The bar length is proportional to internode length. d, Principal component analysis (PCA) of PLINK distance matrix for all accessions, with accessions having Q value > 0.75 colour-coded. e, Splits Tree61 analysis of accessions with Q value > 0.75, with colours matching PCA groups. f, Pisum genomic variation map across all seven chromosomes, including SNPs, insertions and deletions (<50 bp), large-scale structural variations (SVs) and the linkage disequilibrium (LD)-based haplotype map.

Source data

Within the diversity panel, we recorded phenotypic variation for each of Mendel’s seven pairs of contrasting traits (Fig. 1c and Supplementary Table 7). The association of this phenotypic diversity with genomic data (Fig. 1f and Supplementary Tables 812), coupled with linkage analysis from bi-parental mapping populations and gene functional investigations, elucidated the genetic architecture and allelic variants associated with each of these traits (Fig. 2 and Extended Data Fig. 1). We show that, for each trait, a small number of genetic loci contribute to the trait variation.

Fig. 2: Genetic architecture and genomic diversity of the genes underlying the seven pairs of contrasting traits that Mendel studied in detail.
figure 2

a, Images of the contrasting phenotypes of the seven traits. b, Manhattan plots from the GWAS showing the genomic regions with strong peaks associated with phenotypic differences of each trait as scored in this study and plotted against the ZW6 assembly. c, Gene models for R, I, A, Le, P and Fa and associated genomic regions (Gp and V), depicting the wild-type and natural mutant alleles underlying each of the seven traits. Text and illustrations marked in red indicate those identified in this study. Details are provided in the text and Supplementary Information. aa, amino acids.

Novel alleles in four characterized genes

We confirm the previous interpretations for the four characterized Mendelian genes (R, I, Le and A)2,3,4,5,6,7 and identify novel alleles (Fig. 2 and Supplementary Tables 1316). For the R gene, explaining the round versus wrinkled seed phenotype, we detect a single strong, broad signal at the expected genomic position of PsSBEI2 and present the full-length sequence of the Ips-r element inserted in the last exon (Supplementary Fig. 2). R versus r is also the primary determinant of seed protein content variation (Supplementary Fig. 3), suggesting that altered seed starch quality and quantity indirectly affect the ratio of storage protein classes and overall protein content24. For I versus i4,5,6, which corresponds to yellow versus green cotyledons, we characterize two key alleles: i-1 (the insertion of a 5,696-nt TAR element, a Ty1-Copia long terminal repeat (LTR) retrotransposon) and i-2 (a novel 408-bp deletion in the promoter) of the Mg-dechelatase gene (Supplementary Figs. 4 and 5). The phenotypic variability associated with these alleles, driven by transcriptional differences and confounding genetic and environmental factors is described in Supplementary Figs. 6 and 7. For the A gene, which encodes a bHLH transcription factor that is responsible for the presence or absence of anthocyanin pigmentation (A versus a)7, we present several novel alleles (Supplementary Fig. 8), including a rare intragenic suppressor mutation that restores the most abundant a allele to a functional wild-type A allele (Supplementary Fig. 9). Finally, for the Le gene, which encodes GA 3-oxidase1 (also known as GA 3β-hydroxylase), we observe five haplotypes (Supplementary Fig. 10), but the reduced-height le variants were exclusively found in haplotype 1, which carries the known G > A substitution at chromosome (chr.) 5:639,901,919.

Three previously uncharacterized traits

We investigated the gene identities and allelic variants underlying three of Mendel’s seven traits that remained poorly characterized16: ‘the difference in the colour of the unripe pod’ (Gp); ‘the difference in the shape of the ripe pod’ (conditioned by either of two loci, P or V); and ‘the difference in the position of the flowers’ (thought to be conditioned by either of two loci, Fa or Fas).

Pod colour

Although Gp is usually discussed in relation to pod colour, Mendel noted that this is just one feature of the gp mutant1. In mature flowering and fruiting plants, yellow tissues are seen in the petiole, rachis, tendrils and leaflet midribs of young leaves, as well as in the pedicel, peduncle and sepals (Fig. 3a and Supplementary Fig. 11). There are also substantial differences in chloroplast development25, as well as in the physiological and biochemical properties of pod and leaf tissues between green (Gp/Gp) and yellow-podded (gp/gp) varieties (Fig. 3b,c). Here we found that even the green leaves of gp lines show disturbed development of thylakoid membranes (Supplementary Fig. 12), which correlates with a productivity difference between Gp and gp isolines (Supplementary Fig. 13).

Fig. 3: Characterization of the gp mutant.
figure 3

a, General view of near-isogenic plants (BC6 S1 generation from the cross JI0015 gp/gp x Caméor Gp/Gp) developed in this study. Pot diameter is 9 cm. b, Transmission electron microscopic (TEM) sections of pod mesocarp cells. Scale bars, 1 μm. c, TEM sections of leaflet spongy mesophyll cells. Note the poorly developed thylakoid membranes (arrows) in gp compared to Gp. Scale bars, 0.5 μm. d, An approximately 100-kb genomic deletion adjacent to ChlG is illustrated for gp compared to the reference genome ZW6 (Gp). The deletion event in gp lines is illustrated on the Gp reference genome by the dashed box. The approximately 100-kb deletion event was called according to genome assembly comparisons between JI0015 and JI2822 (ref. 62). Far right, expression of transcript types T0 and T1–T5 (Gp/Gp wild type; top right) and t0 and t1–t6 (gp/gp mutant; bottom right); numbers at ends of (or in) bars indicate the expression (in transcripts per million (TPM)). More details are provided in Supplementary Figs. 1519 and Supplementary Tables 2327. e, Crossing scheme for a complementation test between Caméor M4 TILLING line 411.1 carrying one lethal allele of ChlG and gp (JI0015), with the two types of expected F1 genotype. ChlGWT and ChlGW121* represent the wild-type and TILLING alleles of ChlG. WT represents the presence of the wild-type (Caméor) sequence between ChlG and the TIR-NBS-LRR gene, and Δgp represents the approximately 100-kb deletion, which co-segregates with gp. The question being addressed is whether ChlGW121*-WT complements gp (ChlGWT-Δgp). f, F1 pods segregating for green versus yellow. The number after the underscore is the plant number; the parental lines (TILL_6 het and JI0015) and wild-type Caméor are also shown. g, Codominant PCR marker test confirming that all plants presumed to be F1 are Gp/gp heterozygotes (top) and a dCAPS marker PCR test confirming that only the yellow-podded F1 plants inherited the ChlGW121* TILLING allele (bottom). M, DNA size marker (0.5 – 3 kb; 100 bp ladder lane from 1 kb to 0.5 kb and below).

All yellow-podded lines in the John Innes Pisum germplasm collection were found to be allelic to gp, confirming that there is only one known yellow pod locus and here we show that there is only one gp allele. Genetic mapping and association genomics analysis found that all yellow-podded lines carried a deletion of around 100 kb within the GWAS interval that co-segregated with gp (Supplementary Figs. 14 and 15 and Supplementary Tables 1722). With respect to the ZW6 assembly, this large genomic deletion removes one Ogre LTR retrotransposon and an open reading frame encoding a partial NBS-LRR gene, as well as part of exon 5 and the whole of exons 6 and 7 from a gene encoding a TIR-NBS-LRR protein (NLR (also known as Psat03G0414100)) (Fig. 3d).

This deletion is adjacent to the chlorophyll synthase gene (ChlG (also known as Psat03G0413700)), but the ChlG gene is structurally intact in all gp lines, with its encoded amino acid sequence being identical to the wild type. Mapping RNA-sequencing (RNA-seq) reads of the gp lines JI0015 and JI2366 to their matched genome assemblies predicted novel transcripts and isoform variants, including alternative splicing within the NLR–ChlG region (Fig. 3d and Supplementary Fig. 16). Transcriptional fusions, generated by intergenic splicing between the truncated NLR and ChlG, breakpoint read-through of the truncated NLR transcript, and intron read-through of ChlG transcripts were confirmed by PCR with reverse transcription and cDNA sequencing (Supplementary Fig. 17 and Supplementary Table 2325). It can be predicted that the encoded fusion or truncated ChlG proteins would not participate in chlorophyll biosynthesis, nor would they insert into the thylakoid membrane owing to either internalization or the lack of the transit peptide (Supplementary Fig. 18). RNA-seq and quantitative PCR (qPCR) data showed that in gp pods, functional ChlG transcript abundance was reduced to about 6% of that in the wild type (Fig. 3d). In gp mutants, the intact t0 transcript was more abundant in leaves compared with pods but the fused NLR–ChlG t1 and t2 transcript levels were similar in pods and leaves (Supplementary Fig. 19 and Supplementary Tables 26 and 27). We propose that transcriptional interference and aberrant transcripts disrupt ChlG function and chlorophyll synthesis, causing the yellowness of otherwise green tissues in the gp mutant.

To test whether Gp corresponds to ChlG, we obtained a TILLING (targeting-induced local lesions in genomes) mutant26 with a premature stop codon (W121*) in ChlG (Fig. 3e). This mutant could not be recovered as a homozygote, although the mutant allele was transmitted through both pollen and egg cells. We conclude that the homozygous mutation is embryo lethal, but not lethal in either gametophyte. We reasoned that the phenotype of a gp/Gp, ChlGwt/ChlGW121* double heterozygote would be informative; if Gp did not correspond to ChlG, then the plant should be viable and green-podded, refuting our hypothesis. Conversely, if Gp did correspond to a functional ChlG, then it should be yellow-podded. Of the 16 F1 progeny that we derived from the cross between gp/gp and the TILLING mutant heterozygotes, 10 had yellow pods, and all yellow-podded F1 plants carried the ChlGW121* null allele (Fig. 3f,g). This result supported our hypothesis and confirmed that the gp mutant does not provide a fully functional ChlG.

The evidence presented above demonstrates that a ChlG deficiency mediates the mutant phenotype and establishes that ChlG is allelic to Gp. The wild-type alleles of other genes in the vicinity cannot rescue the gp mutant phenotype. While the detailed molecular and regulatory mechanism underlying this defect in chlorophyll synthesis remains to be established, our current understanding predicts that ablation of the NLR gene in a gp mutant, thereby removing the fused NLR–ChlG transcripts, would restore the wild-type green pod colour.

Pod shape

In 1537, Ruel described the difference in the shape of the ripe pod as “Valvulae etia recetes eorum quae nullo pedameto fulciuntur, ante que durescat, edendo sunt27; roughly translated, this means “Those where the valves provide little support are to be eaten before they harden”, indicating that as today, these were vegetable peas. The lack of a sclerenchyma layer (pod parchment) is conditioned by the recessive allele at either (or both) of the genes: P and V16. It is uncertain which of these genes Mendel was discussing; he could have worked with either, or perhaps both. Our GWAS analysis identified regions that are statistically correlated with this phenotype, two of which correspond significantly to the expected positions of P and V (Fig. 2b), suggesting that both p and v alleles are relatively common. Additional signals detected may correspond to genes affecting pod wall thickness (N) or structure (Sin)28 (Extended Data Fig. 2a).

Within our 8.3 Mb GWAS peak on chromosome 1, the gene Psat01G0420500 had the greatest significance, which is consistent with a 0.92 Mb interval defined in the JI2822 × JI0816 F2 mapping population (Extended Data Fig. 2b–d). Psat01G0420500 encodes a dodeca-CLE peptide that is identical to the tracheary element differentiation inhibitory factor (TDIF) of Arabidopsis CLE4121,29 (designated here as PsCLE41; Supplementary Fig. 20). One allele within this single-exon gene, which carries an in-frame premature stop codon (R79*) upstream of the TDIF motif (Extended Data Fig. 2e–g), fully explains the p phenotype. CLE41 or CLE44 peptides are known to repress the formation of xylem30 in Arabidopsis thaliana and specify positional information that determines the rate and orientation of cell divisions in vascular tissue in conjunction with the receptor kinase PXY31. TDIF is proposed to function as a non-cell autonomous signalling peptide controlling cell fate32 and lignification33. This suggests a model for P whereby this TDIF peptide interacts with a PXY-like protein to specify pea pod sclerenchyma development via the established TDIFPXYWOX signalling pathway34,35.

The genomic interval corresponding to V, as identified by GWAS, spans a broad region (chromosome 6, 610–650 Mb) within which the most significant 2-Mb segment (chromosome 6, 629–631 Mb) overlaps with a 1.25-Mb interval (chromosome 6, 629.23–630.48 Mb) defined by an F2 mapping population36, narrowing the V candidate region (Extended Data Fig. 3a–e). A detailed haplotype–phenotype association study within this interval excluded several previously proposed gene candidates (Supplementary Tables 28 and 29 and Supplementary Fig. 21). We found that accessions with parchmentless pods, with or without the R79* mutation (ppvv or PPvv), are clustered into a single haplotype (Extended Data Fig. 3g) for Psat05G0804500, a homologue of the Arabidopsis MYB26 gene (referred to here as PsMYB26, which has also been designated as PsMYB73 (ref. 37; Supplementary Fig. 22). AtMYB26, which encodes a master transcription factor that directly activates its downstream NAC-domain transcription factors in A. thaliana38, has been reported to promote secondary wall thickening and lignification of the endothelium, and MYB26 is required for sclerenchyma formation in legume pods39. PsMYB26 exhibits tissue-specific expression in the pod endocarp and is the most highly differentially expressed gene between wild-type (P/P V/V) pods and mutant (P/P v/v) pods (Extended Data Fig. 3f,h and Supplementary Tables 30 and 31). RNA-seq and qPCR with reverse transcription (RT–qPCR) analyses of both P/P v/v and p/p v/v mutant lines both show reduced expression of PsMYB26 and the PsNAC (Supplementary Information) compared with wild-type P/P V/V lines (Extended Data Fig. 4a–f and Supplementary Fig. 23). Furthermore, a virus-induced gene-silencing (VIGS) experiment followed by RT–qPCR analyses demonstrated the downregulation of PsMYB26 and PsNAC genes in the silenced lines (Extended Data Fig. 4g). PsMYB26 is downregulated in p/p V/V mutant lines, suggesting that PsMYB26 and PsNAC have a role in controlling the parchmentless phenotype and supporting an epistatic relationship between PsCLE41 and PsMYB26 (Extended Data Fig. 4f–h).

A 23 kb Ogre-type LTR retrotransposon insertion, located upstream of PsMYB26 and with the same polarity, is present in all P/P v/v genotypes in our diversity panel (Supplementary Fig. 24 and Supplementary Tables 32 and 33). Although PsMYB26 and the associated Ogre insertion are strong candidates for V, further work is needed to fully elucidate the detailed molecular mechanisms underlying the v allele.

Fasciation

Mendel used the name Pisum umbellatum when describing “the position of the flowers” on the stem of pea, a term previously used by Gerard40 to describe the fasciated form with an umbellate inflorescence. Fasciation in pea can vary in its severity, from stem bifurcation to an extreme clustering of flowers at the apex. There are several pea genes that, when mutant, confer a fasciated phenotype; of these, Fa (chromosome 4, linkage group IV) is considered to be the gene Mendel studied41,42.

GWAS analysis identified a broad signal (chromosome 4, 0–40 Mb) (Fig. 2b) that underwent further refinement through investigation of F2 populations. Bulked segregant analysis (BSA) narrowed this region down to a 15-Mb interval (Supplementary Fig. 25) and fine mapping led to the delineation of a 1.33 Mb candidate interval (chromosome 4, 18.18–19.51 Mb, ZW6) (Extended Data Fig 5a–e and Supplementary Tables 1820, 34 and 35). Within this 1.33-Mb interval, we found that all fasciated accessions were clustered together within haplotype 5 (Extended Data Fig. 5f); however, accession JI1713, also in haplotype 5, usually is not fasciated (see below). Analysis of each gene within this interval showed that only one gene, Psat04G0031700, co-segregated with fasciation. All fasciated accessions are clustered into Hap 3 of this gene, which is characterized by a 5-bp deletion in exon 2, creating a frameshift and premature stop codon, rendering the protein non-functional and explaining fasciation in fa lines (Extended Data Fig. 5g,h and Supplementary Table 36). This gene encodes a cell membrane-localized senescence-associated receptor-like kinase, a class of CLAVATA3 INSENSITIVE RECEPTOR KINASES (CIK) signalling (co-)receptor kinases, known for their role in maintaining the structure of the shoot apical meristem43 (designated as PsCIK2/3; Supplementary Fig. 26). Field phenotyping and microscopic observations of fasciated versus wild-type plants showed that the bunched apical flowers of the mutant are borne on a wider stem with additional vascular strands, derived from a broadened apical meristem (Extended Data Fig. 6a–e). Comparative transcriptome analysis in Caméor revealed high expression of PsCIK2/3, CLV3 and PsCLV2 in the stem (Extended Data Fig. 6f and Supplementary Tables 30 and 31). PsCLV1, also expressed in stems, showed its highest expression in root, whereas PsWUS is uniquely expressed in apical bud. RT–qPCR analysis showed that PsCIK2/3 expression was significantly downregulated in both the apical bud and stem of the mutant line (JI0814, fa/fa) compared with a wild-type line (Caméor), whereas PsCLV3 was significantly ‘upregulated’ in the apical bud of the mutant line (Extended Data Fig. 6f–h). The ‘upregulation’ likely reflects the enlarged size of the fasciated apical meristem, although further investigation in pea is needed to clarify this point. Interactions between PsCIK2/3 and both PsCLV1 and PsCLV2 (Extended Data Fig. 6i,j) were revealed by yeast two-hybrid assays. We hypothesize that PsCIK2/3 functions within the conserved meristem homeostasis CLV3-WUS genetic regulatory pathway (Supplementary Fig. 27), regulating shoot apex development and meristem structure maintenance44.

A second minor GWAS signal was detected on chromosome 6, a region that was not previously associated with fasciation (Fig. 2). This genomic region was also identified in our bi-parental mapping populations (Extended Data Fig. 7a). In the JI2822 (Fa) × JI0816 (fa) F2 population, we observed that out of 397 individuals scored, 32 exhibited a wild-type phenotype but carried the recessive allele at fa (Supplementary Tables 1820), which is consistent with the GWAS and BSA studies. This suggests a model whereby the recessive allele of a gene in this region of chr6LGII masks the fasciated phenotype. Accordingly, we designated this enigmatic second locus as modifier of fa (mfa). In this model, individuals that are recessive for both loci—the fa/fa mfa/mfa genotype—have a wild-type appearance (Extended Data Fig. 7b,c). This may also explain why some accessions, such as JI1713, that carry the 5 bp deletion in Psat04G0031700 (PsCIK2/3) usually are not fasciated, incidentally accounting for the secondary GWAS peak. Previous studies have highlighted complexity in the segregation of fasciation, with reports of both reversals of dominance and 2-factor segregation ratios (15:1), rather than the expected 1-factor segregation ratio (3:1), in F2 populations for some crosses45. These unusual features may, in part, be explained by the previously unrecognized gene Mfa. The nature of Mfa remains to be determined, but it resides within the interval ZW6 chr. 6:244,689,457-253,701,016 identified in this study.

In the F3 generation, the Fa–Mfa genetic model successfully explains the observed phenotypes on the basis of the genetic marker data (Extended Data Fig. 7). The fa/fa mfa/mfa genotypes can appear wild type, on the basis of the position of lateral inflorescence, but may also exhibit some features of fasciation, demonstrating an incomplete penetrance or variable expressivity of mfa, which requires further investigation. We propose that mfa/mfa homozygotes either delay or prevent the formation of structures that would cause a fa/fa plant to be scored as a fasciated phenotype (Supplementary Notes).

Complex and quantitative traits

It has been argued that Mendel’s motivation in studying inheritance was related to an applied plant breeding programme46. To connect further the traits and genes discovered in pea with their agronomic interests and breeding values, we analysed 72 additional agriculturally relevant traits, including seed, pod, flower, leaf, root and plant architecture characteristics measured within our Pisum diversity panel (Extended Data Fig. 8a–c and Supplementary Table 37). A comprehensive GWAS established hundreds of significant marker–trait associations (Supplementary Table 38) including 14 previously cloned genes (Fig. 4a and Supplementary Tables 39 and 40) and determined the physical locations of 22 loci that were previously known from classical genetics to within an average genomic interval of 12 Mb. In addition to the four newly characterized genes associated with three of Mendel’s pea traits, our study uncovered dozens of potentially important new loci, many with breeding implications. For example, the SDN locus (seed number per pod) and SDY locus (seed total weight per plant) related to yield components (Extended Data Fig. 8g,i and Supplementary Table 39). The organ size locus (Os1), which controls pod width and seed weight, is validated below. These results demonstrate the high quality of our dataset, and the robustness of the association genomics analyses, laying a solid foundation for future functional elucidation of pea traits and their application in breeding programs.

Fig. 4: A Genome–phenome association map for the identification of genetic loci that confer agronomic traits.
figure 4

a, Summary of the most significant trait–marker associations underlying a variety of agronomic traits presented as a combined Manhattan plot. Gene symbols shown in a circle correspond to Mendel’s loci. b, Manhattan plot of GWAS data for seed protein content, showing a peak overlapped with the R gene locus. c, Manhattan plot of GWAS data for the presence or absence of axial ring pigmentation, on a subset of phenotypic data excluding accessions carrying white flowers (a/a). These data were collected at Harbin (northern China, 2022). A peak at the expected genomic position of D is significantly associated with the accumulation of axillary anthocyanin, and the peak at chromosome 6 is the location of A. d, Genomic interval of D locus on chromosome 2 defined by recombinant inbred lines (RIL) mapping and GWAS analyses, further defined by bioinformatic analysis of FN mutants as a MYB gene cluster7,47,48, with the genes PsMYB104 and PsMYB106 both deleted in the d mutant line FN1218/6. The region outlined in red line indicates the approximate position of the deletion detected in FN1218/6 from mapping of sequence reads. e, Manhattan plot of GWAS data for Af/af (semi-leafless phenotype). Scale bar, 5 cm. f, Manhattan plot of GWAS data for hundred seed weight (HSW) and pod width (PW). The HSW and pod width genomic intervals span the same 8 Mb genomic region, named Organ Size 1 (PsOs1). Scale bar, 2 cm. g, Narrowed genomic interval of PsOs1 on chromosome 2 defined by two F2 mapping populations and BSA analysis (Methods) as a 1.01-Mb region encompassing 11 protein-coding genes, of which Psat02G0011300 (marked in yellow) is the most highly expressed gene. Photographs in c,e,f show the corresponding contrasting phenotypes.

Source data

Axil ring pigmentation

In his 1866 paper, Mendel noted the pleiotropic effects of the seed coat and flower colour trait (A versus a) and specifically referred to the presence or absence of axil ring pigmentation as one of these effects. The A gene regulates the presence or absence of anthocyanin pigmentation throughout the plant and a is epistatic to D, which regulates the pattern of axil ring pigmentation45. Axil pigmentation patterns in pea (Supplementary Fig. 29) are reminiscent of leaf marking in Trifolium47 and Medicago48, which are controlled by similar MYB transcription factors.

GWAS analysis revealed two strong signals associated with axil ring pigmentation (in coloured flower lines) (Fig. 4c). One of these corresponds to A (chromosome 6), whereas the other is at the expected position of D (chromosome 2), where there is a cluster of MYB genes49 (Supplementary Fig. 30 and Supplementary Table 41). The potential role of one of these MYB genes was investigated further by VIGS, which showed that the MYB-encoding gene Psat02G0138300 (also known as PsMYB16) affects the accumulation of the axil ring anthocyanin pigmentation (Supplementary Fig. 31). Furthermore, deletion of another two MYB genes at the same locus, PsMYB104 and PsMYB106 (ref. 37, in the induced Fast Neutron mutant line FN1218/6 resulted in the complete absence of axil ring pigmentation (Supplementary Figs. 3234). We show that the FN1218/6 deletion is allelic to the d allele in JI0073 and JI2202 (P. abyssinicum, a taxon that lacks axil ring pigmentation) (Fig. 4d), implicating these genes as corresponding to D.

The results presented here reveal the genetic complexity of axil ring pigmentation regulated by D. There are multiple alleles of D within the MYB gene cluster, and many spontaneous conversions from one allelic form to another have been reported45, suggesting that it is the combination of alleles at several of these MYB genes that determines the presence, absence or pattern of this pigmentation. Both a and a2 are epistatic to D, and we can postulate that the MYB genes involved in the D and d phenotypes are part of a MYB (D)–bHLH (A)–WD40 (A2) complex7,50.

Organ size

Mendel examined the segregation of traits that have clear alternative states but noted that seed size (among other traits) differed between his parental lines, although he considered that this quantitative difference was not suitable for his analyses. Seed size in pea defines some market classes, such as the ‘marrowfat’ types, which have large, irregular-shaped seeds and a high protein content. Seed size has been the subject of quantitative trait locus (QTL) analyses51, and we have investigated this further within our diversity panel.

We found a significant locus on chromosome 2 that influences both pod width and hundred seed weight (HSW) (Fig. 4f and Supplementary Fig. 35), which is in a similar location to a previously described seed size QTL in Medicago and pea52. We designated this locus as PsOs1. Combining fine mapping and differential gene expression analysis, we identified Psat02G0011300, which encodes a SIAMESE-related protein (SIM or SMR), a cyclin-dependent protein kinase inhibitor (CKI), that influences cell division and enlargement during the cell cycle and consequently alters plant cell size53, as a gene candidate for PsOs1 (Fig. 4g, Supplementary Figs. 3638 and Supplementary Tables 42 and 43). VIGS-based functional validation, coupled with a transgenic overexpression line in Arabidopsis (Supplementary Figs. 3942) demonstrate the key role of PsOs1 in regulating seed weight and pod width.

Architectural innovation in pea breeding

The adoption of afila types represents the most important innovation in modern pea breeding54. These types include the ‘leafless’ (af/af st/st) and ‘semi-leafless’ (af/af St/St) varieties54. Figure 4b shows a strong GWAS signal for this character at the expected position of Af (afila) at the end of chromosome 2. In addition to the five haplotypes (Hap1–5) corresponding to the wild-type phenotypes (Af), our analysis identified three deletion haplotypes (Hap6, Hap7 and Hap8), which correspond to hap_4, hap_2 and hap_3, respectively54 (Supplementary Figs. 43 and 44), all of which remove the genes PALM1a and PALM1b, associated with the recessive allele (af). These deletion haplotypes are the most abundant and appear in similar relative proportions here and in the previous analysis54.

The presence of multiple af haplotypes confirmed independent origins of the afila alleles. Here, we refined the deletion end points at a nucleotide level resolution for the 30 afila lines showing that these haplotypes are not further differentiated by their deletion end points; the gene content of the deleted segments is thus more precisely defined (Supplementary Tables 44 and 45). This is important because afila alleles delete different combinations of flanking genes, notably PsNaOD1, PsNaOD2 and PsNaOD3, which affect seed yield and seed weight55.

Discussion

Despite the clarity of his 1866 paper, there is some dispute about what Mendel did in his detailed research. It has been argued that Mendel was not primarily interested in inheritance56,57, or that he had a pre-formed theory of inheritance that he sought to demonstrate, even to the extent of fabricating data to conform with his theory58. These views are mutually exclusive, and we reject them both46,59.

We have shown a remarkable diversity of mutational mechanisms in the genes associated with the seven pairs of contrasting traits that Mendel studied. There are several point mutations in a, one affecting the pattern of splicing and two different single nucleotide insertions affecting the reading frame, whereas le corresponds to an amino acid substitution caused by a missense mutation. There is a single nucleotide substitution that generates a premature stop codon in the CLE41 gene at the P locus, and insertion events of class I (i and v) and class II (r) transposons have been observed2,5. We have also uncovered additional novel types of variation, corresponding to DNA deletions that lead to loss of function, such as the remarkable case of gp, with a large DNA deletion upstream of ChlG, a promoter deletion in the i-2 allele, the fa allele, with a small deletion within an exon, and new alleles of a, with one or more deleted exons. An unexpected finding in this study was the existence of an intragenic suppressor allele of A that implies that the a allele was in existence long enough for this unlikely second site mutation to have occurred. This intragenic suppressor mutation corresponds to a shift in the position of an intron, which is rarely identified, even in inter-specific comparisons of many genes60.

The biological processes that these genes represent range from variation in the activity of enzymes in primary metabolism (r, i and gp), hormone interconversion (le), transcription factors for regulation of secondary metabolism (a) and cell wall thickening (v), the regulation of cell fate by a small signalling peptide (p), and a cell membrane (co-)receptor kinase (fa). The two green-versus-yellow phenotypic differences (cotyledon or pod colour) correspond to disruption in either the final step of chlorophyll synthesis (gp) or the first step of chlorophyll degradation (i). This difference between synthesis versus degradation explains which phenotype, green or yellow, corresponds to the dominant versus recessive allele. Although the elucidation of the biochemical and regulatory mechanisms underlying these genes is beyond the scope of this study, the genomic and genetic discoveries presented here are essential for advancing our understanding of Mendel’s pea traits. For example, on the basis of the discovery of the fused aberrant transcripts arising from the NLR–CHLG genomic region, we propose that transcript stability is altered by transcriptional interference during chlorophyll synthesis or through a nonsense-mediated decay pathway, leading to an increased degradation rate of ChlG transcripts.

A longstanding question in relation to Mendel’s pea work was whether the phenotypic variation he described corresponded to rare variants of genes that explain only a minor proportion of the genetic variation for that trait. Our GWAS analyses emphatically show that this is not the case. Indeed, in one case in which genetic heterogeneity was expected (fasciation), the variation that we detected corresponded to a single genetic locus (Fa), albeit with a previously unsuspected modifier locus (Mfa). There are three caveats to this claim. The first is that the parchmentless pod trait is (as has long been known) determined by either p or v, or the combination of these two distinct and independent genetic loci. A second caveat is that for the green versus yellow cotyledon phenotype, there are clearly multiple GWAS peaks, albeit with lower significance than that of I. This probably reflects the influence of the seed maturation process on the penetrance of this phenotype, as was noted by Mendel in his 1866 paper. Finally, we observed an unusual feature of the GWAS peak corresponding to Gp, where there is a broad shoulder corresponding to most of the short arm of this chromosome. The reason for this is unknown.

This raises two general questions regarding GWAS analyses in defining genetic variation underlying traits: whether broad GWAS peaks provide sufficient resolution to identify a manageable number of candidate genes; and how the positions of significant GWAS signals correspond to previously described genetic variants. We have seen that for the seven Mendelian traits (and D), the GWAS peaks are significant, and all correspond well to the expected genetic loci. Furthermore, in our broad survey of many other agronomic traits for genotype–phenotype associations, nearly all the GWAS peaks correspond to the location of previously described genetic loci. This demonstrates that pea is an excellent model system for association genomics studies and GWAS is a suitable first step for trait–gene discovery and functional elucidation. The reliability of GWAS in pea is partly due to the fact that an unusually high proportion of pea genes are single copy17, even though the pea genome is large and gene density is low throughout the chromosomes, maintaining a strong extended linkage disequilibrium. Presumably this is in part because of the strict inbreeding habit of pea.

We have shown how complementary approaches can narrow down these intervals to identify candidate genes. For the genes characterized in this study, GWAS intervals alone were insufficient to delineate small sets of candidate genes. Additional resources and experimental approaches are necessary, such as induced mutants and specific bi-parental mapping populations, comparative transcriptomes, and genetic functional validation. Future work will require innovative approaches and new technologies, such as long-read DNA and RNA sequencing, which could address the limitations of short-read sequencing technologies and are crucial for investigating how large-scale structural variations and transposons contribute to phenotypic changes. Furthermore, a mature genetic transformation system and targeted gene editing in pea are urgently needed. These tools would enable a detailed examination of the biochemical genetics underlying complex mutations, such as the aberrant transcripts produced at the gp locus, the MYB gene clusters at the D locus, and the LTR retrotransposon insertion upstream of the gene encoding the transcription factor MYB26 (V locus).

We leveraged the rich reservoir of genetic diversity from a global Pisum diversity panel and established a high-quality genomic and phenotypic variation map. The large numbers of genotype–phenotype associations that we have found represent the beginning of a new phase of systematic trait dissection at the molecular level in pea. This genotype-to-phenotype strategy can be expanded, with the identified genes and alleles projected into other crops, particularly legumes. This study is essential for pea basic research, education in biology and genetics, and breeding practices.

Methods

Plant materials and methods

Germplasm panel

A total of 697 accessions, maximizing genetic diversity, were selected from the JI Pisum Germplasm Collection for this study (Supplementary Table 1 and https://www.seedstor.ac.uk/). These germplasm accessions were introduced in 2019 to the Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, China, where they are grown and investigated annually.

DNA extraction for whole-genome resequencing

Genomic DNA was extracted from approximately 50 mg leaf tissue of 3-week-old seedlings. Extraction used the oKtopure system (LGC Biosearch Technology) following tissue desiccation with silica for 48 h. A bespoke protocol was used with the following volumes per sample: 250 µl lysis buffer, 170 µl binding buffer, 20 µl sbeadexTM suspension, 300 µl PN1 wash buffer, 300 µl PN2 wash buffer, 300 µl PN2 wash buffer (×3 wash cycles) and using 75 µl final elution buffer. For each accession, a minimum of 6 μg of genomic DNA was used to construct a 150-bp paired-end sequencing library with an insert size of 500 bp, following the manufacturer’s protocols (employing PCR-free methods), which was subsequently sequenced on the DNBSEQ Platform at BGI-Shenzhen resulting in ~80 Gb clean reads with a coverage of ~20× for each accession.

Phenotyping

DNA was extracted from a single plant whose seed was bulked up for progeny phenotyping. The diversity panel was planted in three different sites, Norwich, UK (52.62° N, 1.28° E), Shenzhen (Southern China, 22.61° N, 114.51° E) and Harbin (northern China, 45.86° N, 126.83° E). In China, four rounds of phenotyping were conducted. Specific subsets of accessions and some F2 populations were grown indoors in the greenhouse of Shenzhen Agricultural Field Farm, with 16 h of light/8 h of darkness. Phenotypes collected at the three stations (2020–2023) and a historical JIC phenotype dataset were curated in Seedstor (https://www.seedstor.ac.uk/). In Shenzhen, peas were planted in winter (October) and harvested in March the following year, whereas in Norwich and Harbin they were planted in spring (March to April) and harvested in August to October of the same year.

For the phenotyping of pod colour (green versus yellow-podded lines), a field trial of three 1 m2 microplots of 100 seeds each was sown in spring 2023, in which a 1:1 ratio of BC6 S3 Gp/Gp and gp/gp seeds (selfed seed of S2 homozygotes) were mixed and sown at random in each plot. At the pod filling stage, the Gp plants were tagged and at plot harvest seed was collected from individual plants to determine the yield of Gp and gp homozygotes. Seeds were weighed and counted on a Data Count R25+ machine (https://data-technologies.com/). Pod length and width were measured on 25 randomly selected pods. For the phenotyping of organ size, pod width (PW) and HSW were measured in mature pods of the F2 and F2:3 populations after harvest. In the F2 populations, PW was assessed using 15 representative pods, divided into 3 groups of 5, with the total width of each group measured sequentially. For the F2:3 populations, PW was determined using 5 representative pods, with their total width measured in a similar manner. HSW was calculated by randomly weighing 100 seeds from each accession and repeating the process 3 times to obtain an average weight for each accession. Other more specific phenotypes were collected as described in Supplementary Table 37 and in line with published descriptors (https://www.seedstor.ac.uk/search-phenotypes.php).

Construction of the pea genomic variation map

Read mapping, SNP calling and SNP annotation

The trimmed clean reads of each accession were aligned against the reference genome of pea (P. sativum) cultivar, ZW6 (ref. 18) and Caméor v.1.0 (ref. 17), using BWA-MEM (v.0.7.17) with default parameters63. Unmapped, non-unique and duplicated reads were filtered out using SAMtools64,65 (v.1.9) and Picard (v.2.20.3-SNAPSHOT) before variants were called by a standard pipeline of Genome Analysis Toolkit (GATK65 v.4.1.2) and Sentieon66 (v.202112.01). SNPs were further filtered to remove low-quality variants defined as (1) SNPs with more than two alleles; (2) SNPs with QD < 2.0, FS > 60.0, MQ < 40.0, SOR > 3.0, MQRankSum < −12.5, ReadPosRankSum < −8.0; (3) SNPs with observed heterozygosity (Hobs) exceeding the maximum calculated value (Hobs_max) based on the inbreeding coefficient (F), where F was calculated as 1 − (Hobs/Hexp), with Hexp defined as 2p(1 − p) using the frequency of the non-reference allele, and Hobs_max was determined as 10 × (1 − Fmedian) × Hexp for variants with F > 0 and minor allele frequency (MAF) > 0.05; (4) SNPs with missing rate >20% and MAF < 0.01. SnpEff67 (v.4.3t) was used to annotate the SNPs, and functional significance was then categorized on the basis of their positions with respect to genes (intergenic regions, exons, introns, splicing sites, untranslated regions, upstream and downstream regions) and mutation consequences (missense, start codon gain or loss, stop codon gain or loss and splicing mutations).

Identification of indels, gene PAV and gene CNVs, and SV

Small insertion–deletion mutants (indels; ≤50 bp) were called using GATK (v.4.1.2) and filtered following the criteria: QD < 2.0 || low_QD || FS > 200.0 || high_FS || ReadPosRankSum < −20.0 || low_ReadPosRankSum before they were annotated using SnpEff (v.4.3t). Read depth variation from read mapping analysis was used to identify gene presence and absence variation (PAV) and gene copy number variation (CNV) through normalization and correction in statistical analyses, following five steps: (1), mapped read depth at each gene was counted for each accession; (2), a correction for read depth variation (RDV) was applied, accounting for highly similar genes through all-versus-all coding sequence (CDS) alignment using BLASTN. Recently duplicated genes were collapsed into representative genes to minimize depth bias, which were further normalized by dividing the corrected read depth of the gene by the average sequencing depth of the accession; (3) the distribution of read depth versus GC content was used to correct read depth bias for each gene resulting from differential GC contents; (4), read depth variation was corrected for genomic regions with insertions or deletions in the genome reference; (5), subspecies-unique and shared CNVs were characterized by calculating the number of accessions with different copy numbers for each gene within each subspecies.

Different categories of structural variants (SVs: duplication, inversion, translocation and large-scale deletion or insertion) were detected on the basis of read mapping (read depth and read pair relationships) on PCR-duplicate-marked bam files using Delly (v 0.8.7) with default parameters; a summary of SVs identified is given in Supplementary Table 11.

Linkage disequilibrium analysis and pea haplotype map

A two-step LD pruning process was implemented to generate a high-quality core SNP dataset for the construction of a haplotype map68. Initially, SNPs were pruned on the basis of linkage disequilibrium (LD) using PLINK69, with a window size of 10 kb, a window step of one SNP, and an r2 threshold of 0.8. A second round of LD pruning was conducted with a window size of 50 kb, a window step of one SNP, and the same r2 threshold of 0.8. For population LD-based haplotype analysis, the filtered SNPs were phased using Beagle (v.21Apr21.304)70. Subsequently, haplotype blocks were delineated utilizing PLINK with specific parameters (--blocks no-pheno-req --blocks-max-kb 1000 --geno 0.1 --blocks-min-maf 0.05). To merge adjacent blocks maintaining significant LD, D’ statistic values were calculated between all SNPs of consecutive blocks. If the lower quartile (Q1) exceeded 0.98, the adjacent blocks were merged. After filtering for the inbreeding coefficient, HAPPE71 was employed to identify haplotype clusters (haplogroups) for each block.

Construction of mapping and validation populations

JI2822 × JI0816 F2 population

Lines JI0816 and JI2822 (Supplementary Table 18), both of short stature, are maintained in the John Innes Pisum germplasm collection (https://www.seedstor.ac.uk/). JI0816, also known as WBH 1185, has pink flowers, a fasciated stem and yellow pods lacking pod parchment, corresponding to the mutant alleles b, fa, gp and p, respectively. JI2822, a recombinant inbred line derived from the cross JI0015 × JI0399, is wild type at these four loci. JI0015 and JI0816 share the gp allele, indicating that these two lines had a common parent; therefore segments of the genetic map are devoid of segregating alleles. 1,000 F2 seeds from 9 F1 plants (JI2822 × JI0816) were sown at the JIC field station in spring 2022. DNA preps from 942 plants were prepared from individual leaflets using the Qiagen DNeasy protocol (https://www.qiagen.com). Of these, 405 were genotyped using an Axiom SNP array as described49. The phenotypic and genotypic data are available in Supplementary Tables 1820, and the sequences corresponding to the Axiom markers are available in supplementary table 3 of Ellis et al.49.

JI0015 × JI0399 and JI2822 × JI2833

Three populations have been used for mapping Gp. The first to be used was the previously described recombinant inbred population JI0015 × JI0399 (Supplementary Table 21), later genotyped by Neogen, using an Infinium array as described previously51. The second was an F2 population derived from a cross between two of these RILs (JI2822 Gp/Gp and JI2833 gp/gp), which was screened using PCR for markers already mapped in JI0015 × JI0399 in order to identify informative individuals (Supplementary Table 22). These, together with selected RILs with informative recombination events were genotyped with Axiom markers as described elsewhere49,65. Gp also segregates in the JI2822 × JI0816 F2 population as described above. The marker data are available in the supplementary file: Gp mapping in JI0015 × JI0399 (Supplementary Tables 21 and 22).

Other F2 mapping populations and BSA

We selected parental lines with contrasting pairs of traits to map genetic loci of interest in F2 populations using mapping by sequencing72 of bulked segregants. For genetic loci controlling uncharacterized Mendel traits: flower position (axial versus terminal), pod colour (yellow versus green), and pod shape (inflated versus constricted), crosses were made between Caméor (axial) × JI0814 (fasciated) and JI1995 (green pod) × JI2366 (yellow pod). F2 populations for the P/V loci (pod shape) were derived from the cross between JI0074 (P/P v/v) as the male parent and JI1995 (P/P V/V) as the female parent, and the cross between JI2822 × JI0816. F2 populations for the D locus (one (Dco) or two (Dw) axial rings of anthocyanin pigmentation) were derived from three crosses, with JI0191 (Dw), JI0794 (Dw) and JI1669 (Dw) as male parents and JI0328 (Dco) as the female parent. F2 populations for the Fn/Fna loci (flower number per node, fpn) were derived from four crosses, with JI0441 (1fpn), JI2410 (3fpn), JI0745 (2fpn) and JI0746 (3fpn) as male parents and JI1995 (2fpn) as the female parent. The marker and BSA analysis of the F2 population is as described36.

Approximately 300 plants from the F2 population of each of these crosses were planted in Shenzhen, China. Wild-type and mutant bulked DNA samples were prepared by mixing equal amounts of DNA from 30 accessions with the dominant and recessive phenotypes, respectively. DNA was isolated from fresh leaves using the CTAB method73. 50× depth genome sequences for each of the parents and the bulked samples were generated. Short reads were aligned against the ZW6 reference genome using BWA-MEM (v.0.7.17) and SNPs were identified using Samtools (v.1.9). The variation dataset was analysed using the G’s value method of the QTLseqr package (v.0.7.5.2).

Genetic mapping of Gp

Green versus yellow pod colour segregates in the recombinant inbred (RIL) population derived from the cross between JI0015 (gp/gp) and JI0399 (Gp/Gp). The JI0015 × JI0399 RIL population comprises 90 recombinant inbred lines, which, together with their parents, were genotyped using an Infinium array (Neogen) that detected 13,204 biallelic SNPs. This enabled us to position 5,209 PsCam markers on a genetic map (JI0015 × JI0399) and place Gp between the markers PsCam005046 and PsCam056084 (and their co-segregating markers). Additional mapping was undertaken, using an Axiom SNP array with 84,691 features49 of selected F2 progeny of a cross between JI2822 (Gp) and JI2833 (gp) together with RILs from JI0015 × JI0399 known to have recombination events at informative locations. JI2822 and JI2833 are both RILs from the JI0015 × JI0399 population. With respect to the ZW6 assembly18, this placed Gp between the Axiom markers AX-183865165 (chr. 2:320968993) and AX-183571028 (chr. 3:325580858) (JI0015 × JI0399). Analysis of an F2 population derived from crosses between JI2822 (Gp) and JI0816 (gp) placed Gp between the Axiom markers AX-183571050 (chr. 3:321020350) and AX-183879077, (chr. 3:324762848; Supplementary Table 18).

We performed different association genomics analysis for pod colours, including the SNP-based GWAS, LD-based haplotype GWAS, kmer-derived IBS-based haplotype GWAS, and the SV-based GWAS (Supplementary Fig. 14), all resulting in consistent and significant single GWAS peaks for pod colour located in the expected position of Gp, as seen in Manhattan plots (Supplementary Fig. 14).

Allelism tests for gp

Crosses were made between pairs of yellow-podded lines in the JIC germplasm collection (Supplementary Table 18). Seed and vegetative phenotypes were used to identify F1 progeny plants, and those accessions allelic, or non-allelic, to gp were identified by their yellow or green pod colour, respectively.

Near-isogenic lines for Gp versus gp

The JI0015 gp allele was introgressed into the Caméor background by sequential back-crossing and F1 progeny testing using a codominant PCR marker assay with one forward (25994_F) and two reverse (25994_15R and 25994_399R) primers (Supplementary Table 18). Gp (596 bp) and gp (688 bp) alleles were distinguished in a 35 cycle, 10s–30s–60s Touchdown PCR reaction that reduces the initial 62 °C annealing temperature to 50 °C in the first 10 cycles.

Marker development and QTL mapping for PsOs1

The organ size-related quantitative trait locus (PsOs1) was fine-mapped using 21 Kompetitive Allele Specific PCR (KASP) markers for SNPs distinguishing accessions JI0074 and JI1995 after whole-genome resequencing in the candidate region. Each KASP marker was designed with two allele-specific forward primers (Supplementary Table 47) and one common reverse primer, on the basis of 200-bp sequences upstream and downstream of target genic SNPs, following the standards of LGC Biosearch Technologies. The genetic linkage map was constructed using JoinMap v.4.0 software. Windows QTL Cartographer v.2.5 software facilitated inclusive composite interval mapping (ICIM) for identifying and analysing QTLs. A logarithm of odds (LOD) score of ≥3.0 was deemed indicative of a QTL.

Genome-wide association study

The multi-location and multi-season phenotypic dataset was used to perform GWASs with the SNP matrix using GEMMA (v.0.98.1)74, and employing the following parameters (gemma-0.98.1-linux-static -miss 0.9 –gk -o kinship.txt and gemma-0.98.1-linux-static -miss 0.9 -lmm -k kinship.txt). The structural variation matrix was used to test for association with phenotypic variation for each of the selected traits using the same parameters as above. The haplotype map was used to test for association with phenotypic variation for each of the selected traits using RTM-GWAS75 with the following parameters (rtm-gwas-gsc –vcf in.vcf –out out.matrix and rtm-gwas-assoc –vcf in.vcf --covar out.matrix.evec --no-gxe). We used GEMMA’s Wald tests and directly visualized the resulting P values as –log10-transformed values in Manhattan plots. In-house R scripts were employed for data plotting. To identify SNPs of interest, we applied a Bonferroni correction to the significance threshold (α = 0.1) based on the total number of tested variants (9,214,461 SNPs). This yielded a threshold of –log10(0.1/9,214,461), and any SNP surpassing this cutoff was considered noteworthy and highlighted in the Manhattan plots.

Orthologues and gene family analysis

Phylogenetic analyses were conducted on key gene families in this study, such as MYB, CLE and CIK/SERK, following a consistent workflow. Relevant A. thaliana orthologous genes containing the required domains were retrieved from TAIR (https://www.arabidopsis.org), and profile hidden Markov models (HMMs) were constructed using HMMER (v.3.1b1) on the basis of multiple sequence alignments generated by MAFFT (v.7.475). These HMMs were then employed to identify putative homologues in the pea (P. sativum) ZW6 genome. Multiple sequence alignments for each family were trimmed with trimAl (v.1.5.rev.0), and the best-fit amino acid substitution models were selected using ModelTest-NG (v.0.1.7). Phylogenetic trees were constructed by IQ-TREE (v.2.1.2) with 1,000 ultrafast bootstrap replicates. For synteny analysis for each gene family, we used OrthoFinder (v.2.5.4) to identify orthologous clusters among pea and related legumes (for example, Vicia sativa, Medicago truncatula, Cicer arietinum, Lotus japonicus, Vigna radiata, Phaseolus vulgaris and Glycine max), and visualized collinearity blocks with JCVI (v.1.2.7).

Gene functional validation experiments

Fast neutron mutants

Several Fast Neutron mutants from a population described by Domoney et al.76, were included in this project. These were: FN1453/1sil-like; FN1091/4 lacking axil ring pigmentation, allelic to d; FN1218/6 lacking axil ring pigmentation, allelic to d; FN2026/7coch2 candidate; FN2073/5 lacking axil ring pigmentation, not allelic to d; and FN2076/5VicA FN deletion line.

Crosses were made between pairs of lines lacking axil ring pigmentation (Supplementary Fig. 32) to test for complementation. Where possible, vegetative phenotypes were used to identify F1 progeny plants, and those accessions allelic, or non-allelic, to d were identified by the absence, or presence of pigmented axil rings, respectively.

Complementation test

A reverse genetics screen for the ChlG gene in pea was carried out in an ethane methane sulfonate-mutagenised targeting induced local lesions in genomes (TILLING) population in background Caméor26. Line 411.1, with a G>A mutation 1,900 bp after the ATG, resulting in a W121* nonsense mutation, was identified. Eight M4 seeds were sown and seedlings were genotyped with a cleaved amplified polymorphic sequence (CAPS) marker (Supplementary Table 47). No seedlings were homozygous mutants (signified by a single undigested 1,125-bp band), 6 were heterozygous (signified by 3 bands of sizes 1,125 bp, 699 bp and 426 bp) and 2 were homozygous wild type (signified by 2 bands of sizes 699 bp and 426 bp). A complementation test was carried out by crossing heterozygous seedlings with a homozygous JI0015 gp mutant (13 crosses with male JI0015 and F1 identified by long internodes, and 3 crosses with female JI0015 and F1 identified by yellow cotyledons). Nine out of 16 F1 progeny plants had yellow pods, indicating non-complementation.

Virus-induced gene silencing

VIGS in peas was performed using a published methodology as described previously77. To target genes of interest, a 200–500 bp fragment from the CDS region of each gene were amplified. The primers for VIGS constructs, including VIGS-PsChlG, VIGS-PsOs1, VIGS-PsMYB26 and PsMYB16, are provided in Supplementary Table 47. SpeI/XbaI and EcoRI were used to linearize the pCAPE2 vector, which was kindly provided by Li et al.78, and corresponding fragments of gene targets were ligated into the vector to construct the vectors for VIGS assays. For VIGS-PsChlG, the negative control vector, pCAPE2-Con, was constructed in the same way by replacing the PsChlG fragment in pCAPE2-PsChlG with a 529-bp insert derived from a cDNA fragment of Bean Yellow Mosaic Virus (GenBank accession AJ622899). The positive control vector, pCAPE2-PDS, targeting the phytoene desaturase gene, was also provided by Li et al.78. These vectors were transferred into Agrobacterium tumefaciens (GV3101) (Shanghai Weidi Biotechnology) and VIGS assays carried out following the protocol described by Constantin et al.79. In brief, Agrobacterium strains carrying these vectors were shaken separately until OD600 = 1.2, followed by the collection and resuspension of the bacteria in injection buffer (NaCl: 10 mM, CaCl2: 10 mM, acetosyringone: 0.1 mM) to a concentration of OD600 = 1.2. After resting for 2–3 h, the solution of pCAPE2-target gene, pCAPE2-PDS (positive control), and pCAPE2-Con (negative control) was mixed with pCAPE1, separately, in equal proportions, and injected into 10-day-old compound leaves of the acceptant lines (Yunnan2070 or JI1995). Specifically, pCAPE1 and pCAPE2 are plasmid vectors used to induce gene silencing in plants such as M. truncatula and P. sativum. After 24 h of darkness, they were transferred to long day conditions. New leaves of positive control plants bleached in about 10 days, indicating successful silencing of PDS. VIGS was employed with the same procedure for PsMYB16 gene within the D locus and VIGS-MYB26 for the V candidate gene. For PsOs1, which is described in detail below, all the gene-specific primers used for VIGS constructs are listed in Supplementary Table 47.

Transformation, gene overexpression and silencing of PsOs1

The PsOs1 coding sequence of JI0074 was amplified (primers listed in Supplementary Table 47) and integrated into the pCAMBIA1305 vector, resulting in the pCAMBIA1305-PsOs1JI0074 construct. The plasmid was then introduced into A. tumefaciens GV3101, which was subsequently employed to transform A. thaliana (Col-0) via the floral dip technique. T3 generation homozygous transgenic Arabidopsis lines were selected for measurement of thousand-seed weight and the dimensions of elongated siliques.

GUS staining, GFP fluorescence observations and flow cytometry

The pCAMBIA1305-PsOs1JI1995 vector was constructed using the same methodology, with primers detailed in Supplementary Table 47. Both vectors, pCAMBIA1305-PsOs1JI1995 and pCAMBIA1305-PsOs1JI0074, were introduced into the A. tumefaciens strain GV3101. In these experiments, H2B-mCherry served as a nucleus marker. The agrobacteria were resuspended and infiltrated into Nicotiana benthamiana leaf epidermal cells using an infiltration buffer consisting of 10 mM MES (pH 5.6), 10 mM MgCl2, and 150 μM acetosyringone, at an OD600 of 0.8. Fluorescence was observed 48 h after infiltration using a confocal laser-scanning microscope.

To compare the promoter activities of JI0074 and JI1995 alleles of PsOs1, we cloned sequences 3,000-bp upstream of the coding region and inserted them into pCAMBIA1300-GUS, resulting in the constructs ProJI0074-GUS and ProJI1995-GUS. These were expressed in Nicotiana tabacum leaves and subsequently stained using a GUS Staining Kit (Coolaber Biotech). GUS activity was quantified using the GUS Gene Quantitative Detection Kit (Coolaber Biotech). For a detailed examination of PsOs1 expression patterns in Arabidopsis, various Arabidopsis tissues were sampled from ProJI0074-GUS transgenic plants. After ethanol decolourization, observations and photographs were taken under a microscope. Details of the primers used are provided in Supplementary Table 47.

Intact nuclei from pea pods were isolated using LB01 lysis buffer (Coolaber Biotech), followed by RNA removal and subsequent propidium iodide staining. The nuclei were then quantified using a CytoFLEX flow cytometer. A minimum of 20,000 nuclei were counted for each sample, and each experiment was replicated at least three times. Data analysis was conducted using FLOWJO software, and representative images were presented. The endoreduplication index (EI) was calculated using the formula: EI = [(0 × percentage of 2C nuclei) + (1 × percentage of 4C nuclei) + (2 × percentage of 8C muclei) + (3 × percentage of 16C nuclei) + (4 × percentage of 32C nuclei)]/100.

Yeast two-hybrid experiment

Yeast two-hybrid assays were conducted according to the protocols outlined in the Yeast Protocols Handbook (Clontech). The CDS of PsCIK2/3 was cloned into the bait plasmid pBT3-SUC, while the CDS of PsCLV1 or PsCLV2 was cloned into the prey plasmid pPR3-N. The primer sequences used for cloning are provided in Supplementary Table 47. These plasmids were co-transformed into the yeast strain NMY51 in different combinations. Transformants were initially screened on SD/-Trp/-Leu medium to confirm successful co-transformation. Interaction assays were then performed on SD/-Trp/-Leu/-His/-Ade medium containing the chromogenic substrate X-α-Gal at 30 °C to detect protein-protein interactions.

Anatomical studies and TEM

Confocal images were collected with Leica TCS SP8 confocal laser-scanning microscope (Leica). After sampling, the shoot apices of Caméor and fa mutant line JI0814, the young leaves (gp/gp, JI2366; Gp/Gp JI0817), and the pod walls of JI0074 and JI1995 were immediately preserved in formaldehyde/alcohol/acetic acid fixative. Paraffin sectioning was performed following established methodologies. Staining was conducted using safranin and fast green (JI0074 and JI1995) and toluidine blue (Caméor and JI0814). Prepared slides were scanned using a NanoZoomer, and cell quantification was carried out using NDP.view2 software. For pods, the resin block was sliced at 60–80 nm on an ultrathin slicer, and the slices were picked up on a 150-mesh copper mesh. The copper mesh was stained with a 2% uranyl acetate saturated alcohol solution in the dark for 8 min; washed 3 times with 70% alcohol; washed 3 times with ultrapure water; stained with a 2.6% lead citrate solution in the dark for 8 min; washed 3 times with ultrapure water, and slightly dried with filter paper. The copper mesh sections were placed in a copper mesh box and dried overnight at room temperature. The observation was under a transmission electron microscope and images collected for analysis.

For TEM studies, pea leaflets and pods (18 days after flowering) were removed from BC3 S2 gp/gp and Gp/Gp plants, after 9 h of daylight. Tissue (1 mm2) pieces were placed in a solution of 2.5% (v/v) glutaraldehyde in 0.05 M sodium cacodylate, pH 7.3 for fixation. Samples were left overnight at room temperature and then processed for embedding (Leica EM TP embedding machine) by washing out the fixative with three successive 15-min washes in 0.05 M sodium cacodylate, followed by fixation in 1% (w/v) OsO4 in 0.05 M sodium cacodylate for 2 h at room temperature. After three 15 min washes in distilled water, samples were dehydrated in an ethanol series (30%, 50%, 70%, 95% and two changes of 100% ethanol), then infiltrated with LR White resin (London Resin Company) by successive changes of resin:ethanol mixes at room temperature (1:1 for 1 h, 2:1 for 1 h, 3:1 for 1 h, 100% resin for 1 h, then 100% resin for 16 h, and 100% resin for a further 8 h). Samples were polymerized in LR White resin at 60 °C for 16 h, then sectioned with a diamond knife (Leica UC7 ultramicrotome). Ultrathin sections (approximately 90 nm) were placed on 200 mesh formvar and carbon-coated copper grids (Agar Scientific). Sections were stained with 2% (w/v) uranyl acetate for 1 h and 1% (w/v) lead citrate for 1 min, washed in distilled water and air dried. Grids were viewed in a FEI Talos 200 C transmission electron microscope (FEI) at 200 kV and imaged using a Gatan OneView 4K × 4K digital camera (Gatan) to record DM4 files.

In situ hybridization

Tissues were rinsed with PBS and immediately placed in the in situ hybridization fixative solution for more than 12 h. Paraffin section preparation and in situ hybridization of the probes were performed according to standard protocols80. The sequences of the digoxigenin (DIG)-labelled antisense riboprobes used for in situ hybridization are provided in Supplementary Table 47. The hybridization signal was detected with an alkaline phosphatase-conjugated anti-DIG antibody (200-052-156, Jackson ImmunoResearch). Finally, images were obtained and analysed using a Pannoramic MIDI digital slide scanner (3DHISTECH).

RNA-seq, iso-seq and gene expression

RNA extraction and pea transcriptome

We built a transcriptome atlas from the reference line Caméor (Supplementary Table 30), and selected various accessions that display the contrasting pairs of traits studied here (Supplementary Table 31). In China, plant tissues (seed, root, nodule, leaflet, stem, flower, pod, stipule, tendril and apical bud) at different development stages (seedling, flowering and podding) were collected and fixed in Trizol before RNA extraction. Tissues were ground in liquid nitrogen and the FastPure Universal Plant Total RNA Isolation Kit (Vazyme Biotech) was used to extract total RNA, the quality of which was assessed by gel electrophoresis. For each sample, we performed short-read RNA-sequencing using the DNBSEQ Platform at BGI Group to generate 6–8 Gb raw RNA reads for each accession.

At JIC, RNA was prepared from young developing pods (flat pod stage, ~60–70 mm in length) of each of the parental and RI lines derived from the cross between JI0015 (gp/gp) and JI0399 (Gp/Gp). Developing seeds were removed from the pods, which were then rapidly frozen in liquid nitrogen. High-quality RNA lacking genomic DNA was extracted from 97 individual pod samples, using a Spectrum Plant Total RNA Kit (Sigma-Aldrich), and used for PCR with reverse transcription and RNA-seq experiments focussed on the identification and characterisation of gene candidates for gp. For the latter analysis, green-podded and yellow-podded RILs (95 in total) were assigned to three groups for each phenotype, ensuring that lines with contrasting plant phenotypes (e.g. plant height) were randomly distributed among the replicate groups (G1, G2 and G3 for green-podded RILs; Y1, Y2 and Y3 for yellow-podded RILs, with 15–17 RILs per pool). Equal amounts of RNA from every line within a group were pooled. RNA-seq (Illumina HiSeq4000) and initial bioinformatic analyses were carried out by the Earlham Institute.

We performed Iso-seq sequencing for a subset of accessions for the target organ at specific developmental stages. We used the Iso-Seq (v.4.0.0) pipeline to process PacBio SMRT Cell subreads and generate high-quality, full-length transcripts. First, subreads from each SMRT Cell were processed with ccs (v.3.4.1) to produce one circular consensus sequence per zero-mode waveguide, applying a minimum read quality of 0.9. Next, primer removal and demultiplexing were performed with lima (v.2.9.0) in Iso-Seq mode, removing unwanted primer combinations and orienting reads from 5′ to 3′. Full-length reads were then refined by trimming poly(A) tails and removing concatemers. When multiple SMRT cells were sequenced, the resulting full-length non-concatemer (FLNC) BAM files were merged before clustering. Iso-Seq cluster (v.4.0.0) was applied to produce polished consensus transcripts, partitioned into high-quality (HQ) and low-quality (LQ) sets on the basis of predicted accuracy. The final consensus transcripts were aligned to the ZW6 and JI0074 reference genomes using pbmm2 (v.1.14.99). Last, iso-seq collapse was used to collapse redundant isoforms and generate GFF files, which were converted to GFF3 with gffread (0.12.7) for downstream analyses.

RT–qPCR

RT–qPCR was conducted to analyse gene expression. Total RNA was extracted using the FastPure Plant Total RNA Isolation Kit (Vazyme Biotech) following the manufacturer’s instructions, including an on-column DNase I digestion step to remove genomic DNA. Subsequently, 1 μg of RNA was used for cDNA synthesis with the All-in-One First-Strand cDNA Synthesis SuperMix for qPCR kit (TransGen Biotech). Green qPCR SuperMix kit (TransGen Biotech) was used for amplification on a CFX384TM Real-Time System (Bio-Rad). The method of collecting plant material for the detection of P/PsCLE41 and V/PsMYB26 was as follows: (1) pods were collected 10 days after flowering. After removing the seeds, the pods were cut into pieces and frozen in liquid nitrogen; (2) apical buds were collected 14 days after emergence and, after removing the extra young leaves under a microscope, the buds were cut into pieces and frozen in liquid nitrogen; (3) stems were collected 14 days after emergence at the third node from the top of the stem, cut into pieces and frozen in liquid nitrogen; (4) seed cotyledons were collected 12 days after flowering; after removing the testa, the cotyledons were cut into pieces and frozen in liquid nitrogen.

To validate the Gp transcriptional fusion and aberrant transcripts, the total RNA was reverse transcribed to cDNA using HiScript III First Strand cDNA Synthesis Kit (+gDNA wiper, Vazyme Biotech). RT–qPCR analysis was conducted using Taq Pro Universal SYBR qPCR Master Mix (Vazyme Biotech), employing specific primers, with PsACTIN serving as the internal standard. Expression levels of genes were quantified relative to the control based using the 2−ΔΔCT method. All RT–qPCR results represent the mean ± s.d. from three separate biological experiments. The primers used for RT–qPCR primers used are provided in Supplementary Table 47.

DNA methylation sequencing

Bisulfite treatment and libraries were prepared accordingly to the standard protocol of methylation library construction. PE150 sequencing was performed using Illumina NovaSeq X Plus sequencing platform 25B chip. Quality control and adapter trimming of the raw whole-genome bisulfite sequencing (WGBS) data was then performed using Trimmomatic (v.0.39). The resulting WGBS reads were mapped to the JI0074 reference genome using Bismark (v.0.23.0) and PCR duplicates were removed from the aligned reads.

Statistical methods

General statistical analysis

Statistical analyses were conducted in R software suite (v.4.2; https://www.r-project.org/) unless otherwise stated. The correlation between different traits was tested by calculating the coefficients of Pearson correlation, as well as the P values, using the cor.test package, with the method set to ‘Pearson’ for the correlation analyses between quantitative traits. Traits collected at different locations and in different years were analysed by calculating their rank correlations by setting the option ‘method’ to ‘Spearman’. The correlation between qualitative traits was assessed using the chi-square test using the ‘chisq.test’ package in R. Gene expression levels in different lines or organs under different treatments or at different developmental stages (at least three biological replicates for each sample) were analysed using DESeq2 (ref. 81), in which the genes with a false discovery rate (Bonferroni) lower than 0.01 were defined as significantly regulated genes, unless there is an alternative explanation in a specific legend. The min–max scaling (normalization) approach was used to calculate the expression level for comparison of each gene across stages and organs by the formula: Xscaled = (X − Xmin)/(Xmax − Xmin).

PCA (main text Fig. 1) was performed on the PLINK distance matrix using an Excel add-in downloaded from RIKEN, now available at https://systemsomicslab.github.io/compms/others/main.html#Statistics.

Statistics and reproducibility

All experiments were designed with explicit consideration of statistical power and reproducibility. Each experiment was independently repeated at least three times (biological replicates) with consistent results across repetitions, including microscopy analyses used at least three samples or technical replicates per experimental condition (such as Fig. 3b,c,g and Extended Data Figs. 4a–c,e and 6b,h,i); phenotypic observations were validated across ≥3 independent growth cycles (Norwich, Shenzhen, Harbin, multiple sites and multiple years). Micrographs shown are representative of at least three independent experiments. Complete protocols and raw data supporting the conclusions are available in the Data availability and Supplementary Information. Data are presented as mean ± s.e.m., and statistical significance was determined using two-tailed Students’ t-tests (‘t.test’ package in R software v.4.2) in the analyses of the phenotypes, such as seed weight and pod width, between different accessions.

Population structure analysis

The core high-quality SNP dataset was used for population structural analyses. PCA and t-distributed stochastic neighbour embedding analyses were first performed using beta Python modules sklearn.decomposition and sklearn.manifold. ADMIXTURE82 (version 1.3.0) was employed to analyse the population structure, with K increasing from 2 to 16. The Q value represents the estimated proportion of an individual’s genome that originates from each inferred ancestral population cluster.

Genetic differentiation (Fst) and nucleotide diversity (π) were calculated with VCFtools (version 0.1.13). Fst scores were calculated within nonoverlapping 100-kb windows and π was calculated for each individual site and averaged across the genome for each group. LD was calculated on SNP pairs within a 500-kb window using PopLDdecay83 (version 3.31; https://github.com/BGI-shenzhen/PopLDdecay) and the decay was measured by the distance at which the Pearson’s correlation efficient (r2) dropped to half of the maximum. Splits tree analysis of the PLINK distance matrix was performed using SplitsTree4 (ref. 61).

Germplasm availability

All the germplasm described and used in this work is available to order from the John Innes Centre Germplasm Resources Unit (https://www.seedstor.ac.uk/). The 697 sequenced single seed descent lines derived from John Innes Pisum Germplasm accessions, were imported in 2019 to the Agricultural Genomics Institute at Shenzhen (AGIS), Chinese Academy of Agricultural Sciences (CAAS), where they have been grown and phenotyped annually since then.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.