Introduction

Phenylpropanoids comprise a diverse group of compounds involved mainly in plant defense, structural support, survival and adaptation to environmental perturbations1,2. They also protect the plant against UV radiations, herbivores, and pathogens and mediate the plant-pollinator interactions by producing different floral pigments and scented products2,3. Their biosynthesis occurs through the phenylpropanoid pathway. In the core phenylpropanoid pathway, phenylalanine is converted into activated hydroxycinnamic acid derivatives via the sequential action of PAL (phenylalanine ammonia lyase), C4H (cinnamate 4-hydroxylase) and 4CL (4-coumarate: CoA ligase) enzymes (Fig. 1). The end product of this pathway acts as a precursor molecule for the biosynthesis of various secondary metabolites such as lignins, coumarins, benzoic acids, stilbenes, and flavonoids etc1,3. Thus, this pathway originates from phenylalanine and ends up with the synthesis of a large class of phytochemicals.

Fig. 1
figure 1

Flowchart depicting the phenylpropanoid pathway with positions of PAL: phenylalanine ammonia-lyase; C4H: cinnamate 4-hydroxylase and 4CL: 4-coumarate: CoA ligase enzymes in the pathway.

In the first step of the pathway, phenylalanine is converted into trans-cinnamic acid in the presence of PAL enzyme via non-oxidative deamination of phenylalanine2,4,5. This step basically channels the flow of carbon from primary metabolism into secondary metabolism, thereby interconnecting these two physiological processes in plants6. The enzyme PAL is present in all plants, some fungi and bacteria but is not present in animals7. The first PAL was identified in Hordeum vulgare4. Since then, researchers had gained interest in studying the PAL gene regulating this enzyme in numerous species of the plant kingdom like Citrullus lanatus8, Eucalyptus grandis9, Malus domestica10, and Camellia sinensis11 etc. Numerous studies have consistently demonstrated that the PAL gene exhibits a stress-responsive behavior. It is known to be activated by a range of environmental factors, including UV radiation12, pathogen infections12,13, tissue injury13, extreme temperatures14, nutrient depletion15, long term phosphate starvation16, salinity and water stress17, and other similar stimuli.

In the second step of the phenylpropanoid pathway, C4H catalyzes the hydroxylation of cinnamic acid or cinnamate, thus, yielding p-coumaric acid or 4-coumarate18. C4H, a cytochrome P-450 dependent monooxygenase was initially discovered in 1967 in pea seedlings19. Later on, studies related to C4H have been conducted in various model plants such as Oryza sativa20 and Arabidopsis thaliana5,21. C4H proteins have been divided into two groups, C4H class I and C4H class II, wherein the class I members play a major role in lignin biosynthesis while class II members have been associated with stress responses in plants18,21,22. Additionally, the expression profiling of C4H genes in various tissues during different growth stages has been evaluated in Populus tremuloides23, Populus trichocarpa20, Leucaena leucocephala24, Dryopteris fragrans25, and Eucalyptus grandis9. Further, a change in the expression of the C4H gene has also been observed in Morus notabilis in response to heavy metal stress26 and in Camellia sinensis in response to wounding and abiotic stress conditions27,28, thus highlighting the stress-responsive nature of C4H genes.

The last step of the pathway is marked with the formation of p-coumaroyl CoA from p-coumaric acid which is catalyzed by 4-coumarate: CoA ligase (4CL). Similar to C4H, the 4CL enzymes have also been divided into three different clusters namely, class I, class II and class III on the basis of evolutionary analysis29,30. 4CL directs the flow of carbon from the core phenylpropanoid pathway into the biosynthesis of numerous phenylpropanoid-derived compounds29. The first 4CL gene was cloned from Petroselinum crispum31. The characterization of 4CL genes has been done in a variety of plant species like Glycine max32, Panicum virgatum33, Populus tomentosa34, Boehmeria nivea35, and Citrus sinensis36 etc. The expression patterns of 4CL genes have been investigated in various tissues of several plants, during different stages of development29,35,37,38,39,40,41, and in response to various triggers such as elicitors/phytohormones42, abiotic stress43,44, and UV exposure39 etc.

The number of members in the PAL, C4H and 4CL gene family varies considerably amongst different plants (Table 1). In the PAL gene family, 17 genes were found in Brassica napus which was the maximum number in this gene family45. In the case of the C4H gene family, a single C4H member was identified in Arabidopsis thaliana while Brassica napus showed the presence of 10 C4H members5,21,46. In a similar manner, the highest number of 4CL genes have been reported in Malus domestica which showed the presence of 69 genes47. Thus, variations in the numerical strength of these gene families suggested that different members may play a role in production of several kinds of phenylpropanoids in plants. Moreover, different expression patterns in a variety of vegetative and reproductive tissues of PAL, C4H and 4CL genes depicted their role in plant growth and development.

Table 1 Quantitative strength of PAL, C4H and 4CL gene family across various plant species.

Orchids are one of the largest families of flowering plants and harbour a wide variety of bioactive compounds known for their therapeutic importance90,91,92. One such orchid capturing a huge market size and forming a multibillion-dollar market is Vanilla planifolia. Its highly valued phytochemical, vanillin, has a wide range of applications like usage as flavors and fragrance ingredients in ice-creams, confectionaries, milk products, perfumes etc. Vanillin is also exploited for its multiple therapeutic properties namely, anticancer and neuroprotective activity93,94,95. A byproduct (a C6-C3 phenylpropanoid) of the phenylpropanoid pathway serves as a precursor in vanillin biosynthesis96. Additionally, reports have demonstrated a positive correlation between the upregulated expression of the PAL and C4H genes and the accumulation of vanillin in Vanilla planifolia97,98. Hence, the genome of Vanilla planifolia was sourced from NCBI99 and PAL, C4H and 4CL genes were identified in the phenylpropanoid pathway and subjected to in silico characterization studies. The detailed study of these pivotal genes of the phenylpropanoid pathway will lay the groundwork for investigating its functional aspect in the production of vanillin in V. planifolia. In addition, it could also serve as a template for studying these genes in other plants and lead to the identification and characterization of novel, bioactive compounds through varied biotechnological approaches.

Materials and methods

Identification of PAL, C4H and 4CL proteins

For the current analysis, genome data of Vanilla planifolia [Accession: PRJNA753216; https://www.ncbi.nlm.nih.gov/bioproject/PRJNA753216/] was taken from National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/). BLASTp searches were conducted using protein sequences of PALs16,49, C4Hs20,21 and 4CLs50,73 of Arabidopsis thaliana and Oryza sativa as query against the protein sequences of V. planifolia in NCBI, keeping all the parameters at their default values.

Analysis of conserved domains, motifs and multiple sequence alignment

The identification of protein sequences in Vanilla planifolia was confirmed by checking the presence of conserved domain, Lyase_aromatic (PF00221) for PAL, p450 (PF00067) for C4H and AMP-binding (PF00501) and AMP-binding_C (PF13193) domains for 4CL using SMART server (http://smart.embl-heidelberg.de/)100. Further, the protein architecture of those proteins that had the conserved domain present was built using the My Domains – Image Creator tool in the Expasy PROSITE server (https://prosite.expasy.org/)101. Conserved motifs and their location within the identified PAL, C4H and 4CL Vanilla planifolia proteins along with their counterparts in Arabidopsis thaliana and Oryza sativa were predicted by using Multiple Expectation Maximization for Motif Elicitation (MEME) Suite (version 5.5.5) (https://meme-suite.org/meme/)102. Any number of repetitions for a motif was allowed in the analysis and the maximum number of motifs to be detected was selected as 10. The minimum and the maximum motif width were set as 6 and 50, respectively, keeping all other parameters at the program’s default values. In addition, multiple sequence alignment was also performed by aligning the protein sequences of Vanilla planifolia, Arabidopsis thaliana and Oryza sativa using MultAlin (http://multalin.toulouse.inra.fr/multalin/)103 for further confirmation of the identified proteins by checking the presence of conserved regions.

In silico prediction of physicochemical properties

The sequence length (total amino acids), molecular weight, isoelectric point, instability index, aliphatic index and GRAVY (grand average of hydropathicity index) value of Vanilla planifolia proteins were predicted using PROTPARAM (https://web.expasy.org/protparam/)104 and the sub-cellular localization was predicted by using Plant-mPLoc server (http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/)105.

Secondary structure prediction

The secondary structure of the Vanilla planifolia proteins depicting the percentages of alpha helices, extended strands, beta turns and random coils in the proteins was predicted by using the SOPMA online server (https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_sopma.html)106.

Phylogenetic analysis

The full-length amino acid sequences of PAL proteins of Vanilla planifolia were aligned with the PAL proteins of Cephalotaxus hainanensis56, Arabidopsis thaliana21, Apostasia shenzhenica48, Dendrobium catenatum48, Phalaenopsis aphrodite48, Phalaenopsis bellina48, Phalaenopsis equestris48, Phalaenopsis lueddemanniana48, Phalaenopsis modesta48, Phalaenopsis schilleriana48, Sorghum bicolor87 and Oryza sativa16 using Muscle software with gaps. Then, the phylogenetic tree was constructed using the neighbor-joining method with pairwise deletion and 1000 bootstrap replicates by using Molecular Evolutionary Genetics Analysis (MEGA-XI)107. In a similar manner, using the same parameters, a phylogenetic tree was constructed for C4H proteins of Vanilla planifolia along with the C4H proteins of Arabidopsis thaliana21, Camellia sinensis28, Fagopyrum tataricum63 and Oryza sativa20. Further, to explore the evolutionary relationships amongst the various identified Vpl4CL proteins along with the 4CLs of A. thaliana50, Panicum virgatum33, Scutellaria baicalensis84, Solanum tubersoum86 and O. sativa73, a phylogenetic analysis was also carried out.

Gene structure analysis

The genomic and CDS (coding) sequences of Vanilla planifolia genes were retrieved from the NCBI database. The gene architecture was constructed to analyze the distribution of exons and introns for each gene by comparing the genomic and coding sequences using Gene Structure Display Server (GSDS v.2.0) (http://gsds.gao-lab.org/)108.

Promoter analysis

Promoter region sequences, 1.5 kb upstream from the start codon were retrieved from the NCBI database. Later on, the obtained promoter sequences of the genes were subjected to the PlantCARE database (https://bioinformatics.psb.ugent.be/webtools/plantcare/html/)109 for predicting the type and position of cis-regulatory elements.

Gene expression analysis

The CDS (coding sequences) of V. planifolia genes were used in order to perform a NCBI Sequence Read Archive (SRA) BLASTn search against the RNA sequence data derived from the tissues of Vanilla planifolia [stem (SRX11714032), leaf (SRX11714030), soil root (SRX11714033), aerial root (SRX11714034), flower bud (SRX11714036), flower (SRX11714031), ovary (SRX11714037), and fruit (SRX11714029)]. This sequence data is a part of the NCBI Bioproject accession: PRJNA75321699. The number of hits were counted and subsequently, the RPKM values were estimated as (C*109)/(N*L), where C denotes the number of hits corresponding to a particular sequence, N denotes the total number of reads in that specific RNA-seq experiment and L denotes the length of the CDS sequence for the particular candidate gene110. Heatmaps were generated individually for VplPAL, VplC4H and Vpl4CL genes using the ClustVis visualization tool (https://biit.cs.ut.ee/clustvis/)111 in order to study the relative expression of these genes in different tissues.

Results

Identification of PAL, C4H and 4CL proteins and conserved domain & motif analysis

On performing a BLASTp search using Arabidopsis thaliana and Oryza sativa PAL, C4H and 4CL protein sequences, a total of 16, 8 and 9 sequences of PAL, C4H and 4CL gene family, respectively were predicted in Vanilla planifolia based on query coverage, percentage identity, alignment score and e-value. Domain analysis depicted that Lyase_aromatic (PF00221), the conserved domain of PAL proteins, was found to be absent in eight sequences and two members had a partial sequence and hence, these sequences were not taken for further analysis. Thus, the remaining six PAL sequences were considered for further characterization. Similarly, two non-redundant and complete C4H proteins were identified in V. planifolia by checking the presence of the conserved domain, p450 (PF00067). On these lines, only five 4CL sequences were identified that showed the presence of both AMP-binding (PF00501) and AMP-binding_C (PF13193) domains. Further, domain architecture showed the presence and position of the conserved domains in all the proteins of V. planifolia (Supplementary Table S1, Figs. 2a, 3a and 4a). A total of 10 conserved motifs (marked as 1–10) were predicted in VplPAL, VplC4H and Vpl4CL proteins. The lyase_aromatic domain of PAL proteins was depicted by motifs 1, 2, 3, 4 and 8 which were found to be present in all the PAL members of V. planifolia, A. thaliana and O. sativa. All these above motifs also consisted of catalytically active essential residues of PAL proteins (Fig. 2b). For C4H proteins, motif 4 which represented the p450 domain and consisted of the hinge motif sequence (PPGP) was present in all the C4H proteins (Fig. 3b). In case of 4CL proteins, motif 1, 2, 4, 6 and 8 occupied in the AMP_binding domain and were present in all the proteins except Vpl4CL5 where motif 6 was absent. Further, motifs 1 and 2 had Box II sequence present in them and substrate binding residues resided in motifs 2, 4, and 8. Motif 5 which was present in all the proteins represented the AMP binding_C domain and had the residue essential for the enzymatic function of 4CL present in it (Fig. 4b).

Fig. 2
figure 2

VplPAL proteins showing the position of conserved domains and motifs.

Fig. 3
figure 3

VplC4H proteins showing the position of conserved domains and motifs.

Fig. 4
figure 4

Vpl4CL proteins showing the position of conserved domains and motifs.

Multiple sequence alignment

Multiple sequence alignment of VplPAL1-6 proteins along with their counterparts in Arabidopsis thaliana and Oryza sativa showed the presence of five conserved domains [N-terminal domain, MIO (4-methylidene-imidazolone-5-one) domain, core domain, shielding domain and C-terminal domain], that were characterized in all the identified PALs through MultAlin. A high degree of sequence conservation was depicted in all these domains except for some sequence divergence in the N-terminal domain. All the PAL proteins depicted that the three amino acid residues ‘Ala-Ser-Gly’ (ASG), playing a significant role in substrate binding and catalysis of the MIO-domain, were completely conserved. In addition, the ‘FL’ residue known for imparting substrate specificity to the PAL enzymes was also seen to be conserved. Alongside these conserved residues, other catalytically active sites such as GLALVNG, NDN, and HNQD were predominantly conserved in the majority of PAL proteins (Fig. 5).

Fig. 5
figure 5

Multiple sequence alignment showing the conserved regions along with the catalytically essential active residues of PAL proteins.

The alignment of the protein sequences of VplC4H1 and VplC4H2 along with C4H proteins of Arabidopsis thaliana and Oryza sativa showed that all the residues related to C4H activity such as substrate recognition sites, ERR triad, heme-iron binding domain, hinge motif and enzymatic active sites were conserved. A total of five substrate binding sites were found in all proteins along with ERR triad and enzymatic active sites. The hinge motif denoted by the sequence (PPGP) and heme-iron binding domain (PFGVGRRSCPG) were also found in all the proteins (Fig. 6).

Fig. 6
figure 6

Multiple sequence alignment showing the presence of conserved residues required for the enzymatic activity of C4H proteins.

In 4CL proteins, the two signature motifs, Box I (SSGTTGLPKGV) and Box II (GEICIRG) were found to be conserved in Vpl4CL proteins upon aligning with the 4CL proteins of A. thaliana and O. sativa. Additionally, the amino acid residues involved in substrate binding and enzymatic function were conserved in the majority of the proteins (Fig. 7).

Fig. 7
figure 7

Sequence alignment of Vpl4CL proteins with At4CL and Os4CL proteins.

In silico prediction of physicochemical properties

The physicochemical properties of VplPAL, VplC4H and Vpl4CL were evaluated using various in silico tools (Table 2). The average length was found to be 717, 497 and 524 for VplPAL, VplC4H and Vpl4CL proteins, respectively. Molecular weight showed an average of 77.61 kDa for VplPAL proteins, 57.06 kDa for VplC4H proteins and 56.69 kDa for Vpl4CL proteins. The isoelectric point ranged from 5.57 in Vpl4CL4 to 9.18 in VplC4H2. The GRAVY value of both VplPAL and VplC4H proteins was negative. However, in the case of Vpl4CL proteins, all the proteins had positive GRAVY value except for Vpl4CL4. The instability index was below 40 for a majority of proteins except for VplPAL6, VplC4H1 and VplC4H2. Transmembrane helices were absent in all the proteins. Subcellular localization prediction revealed the localization of VplPAL, VplC4H and Vpl4CL proteins in the cytoplasm, endoplasmic reticulum and peroxisome, respectively.

Table 2 Physicochemical parameters of VplPAL, VplC4H and Vpl4CL proteins.

Secondary structure prediction

Analysis of secondary structures of VplPAL, VplC4H and Vpl4CL proteins showed that the alpha helices and random coils predominate the secondary structure of VplPAL and VplC4H proteins followed by low percentages of extended strands and beta turns. However, in Vpl4CL proteins, the percentage of alpha helix and random coil were almost similar. Additionally, the distribution of extended strands was also found to be higher in Vpl4CL proteins compared to the VplPAL and VplC4H proteins (Table 3; Fig. 8).

Table 3 Distribution of secondary structures in VplPAL, VplC4H and Vpl4CL proteins.
Fig. 8
figure 8

Analysis of secondary structures showing the presence of alpha helices, extended strands, beta turns and random coils by blue, pink, red and green colour, respectively.

Phylogenetic analysis

The evolutionary analysis for PAL proteins showed that the PAL proteins belonging to gymnosperms, dicots and monocots formed three separate clades and orchids clustered together in the monocot clade (Fig. 9). In the case of C4H proteins, the VplC4H proteins clustered together with the class I members of C4H proteins of A. thaliana (AtC4H) and O. sativa (OsC4H1 and OsC4H4) (Fig. 10). Similarly, phylogenetic analysis in 4CL proteins depicted close clustering of Vpl4CL1 with class II members of 4CL proteins of A. thaliana (At4CL3) and O. sativa (Os4CL2) and Vpl4CL2-5 clustered with class III 4CL proteins of O. sativa (Os4CL1, 3, 4 and 5) and Panicum virgatum (Pv4CL1) (Fig. 11).

Fig. 9
figure 9

Phylogenetic tree of PAL proteins showing the distinction between gymnosperms, monocots and dicots. PAL proteins belonging to different plants have been marked with different symbol annotations.

Fig. 10
figure 10

Phylogenetic tree of C4H proteins showing the distinction between class I and class II. C4H proteins belonging to different plants have been marked with different symbol annotations.

Fig. 11
figure 11

Phylogenetic tree of 4CL proteins showing the distinction between class I, class II and class III. 4CL proteins belonging to different plants have been marked with different symbol annotations.

Gene structure analysis

The exon-intron organization was similar in all the VplPAL genes with the presence of one intron in the biphasic phase except for the VplPAL6 gene which had two introns present. An identical arrangement of exons and introns was observed for VplC4H genes as both the genes had two introns present; one in the monophasic intronic phase and the other in the biphasic intronic phase. Interestingly, multiple introns were present in Vpl4CL genes. Vpl4CL1 showed the presence of six introns and seven exons while Vpl4CL2-5 genes had four introns (Supplementary Table S2, Fig. 12).

Fig. 12
figure 12

Exon-intron arrangement in VplPAl, VplC4H and Vpl4CL genes.

Promoter analysis

On evaluating the promoter sequences of all the VplPAL, VplC4H and Vpl4CL genes, apart from the cis-acting elements such as CAAT and TATA boxes which are found commonly in all the genes, various other elements were also identified. These elements regulate four basic responses in plants; plant growth and development, phytohormone response, abiotic and biotic stress response and light response (Supplementary Table S3). Stress-responsive and light-responsive elements were found in more abundance in comparison to elements regulating phytohormone responses and plant growth and development. The cis-acting elements like ACI, ACII and O2-site identified in the present study played a critical role in plant growth and development. Elements that showed phytohormone responsiveness are ABRE, CGTCA-motif, ERE, P-box, TGACG-motif and TCA element. Some stress-responsive elements that were detected included ARE, DRE core, MYB, MYC, MYB-like sequence, WRE and WUN-motif. Along with them, some light responsive cis-acting elements such as AE Box, Box 4, G-Box, GA motif, GATA motif, GT1-motif, MRE and chs-CMA1a were also identified (Fig. 13).

Fig. 13
figure 13

Distribution of cis-regulatory elements into different categories in VplPAL, VplC4H and Vpl4CL genes.

Gene expression analysis

The relative expression patterns of VplPAL, VplC4H and Vpl4CL genes were investigated in a variety of vegetative and reproductive tissues of Vanilla planifolia (Supplementary Table S4). VplPAL5 and VplPAL6 genes were expressed at elevated levels in the ovary while VplPAL2-4 and Vpl4CL2 showed high expression in fruit relative to the other tissues. Both VplC4H1 and VplC4H2 shared a similar expression profile by depicting high expression in the flower and relatively lower expression in the rest of the tissues under study. Lower expression of VplPAL, VplC4H and Vpl4CL genes was observed in the vegetative tissues compared to the reproductive tissues (Fig. 14).

Fig. 14
figure 14

Expression analysis of VplPAL, VplC4H and Vpl4CL genes in various tissues of Vanilla planifolia.

Discussion

PALs, C4Hs and 4CLs are the key enzymes of the core phenylpropanoid pathway which contribute towards the synthesis of phenylpropanoids that act as precursor molecules for the production of a myriad of compounds that play a role in plant growth and development, defense against pathogens and response to environmental cues2,3. Although PALs, C4Hs, and 4CLs play crucial roles in plant metabolism, they are not characterized by a large number of members in most plant species. In silico characterization of genes has emerged as a crucial research technique in molecular biology to comprehend the different metabolic pathways and there are no earlier reports on the characterization of PAL, C4H and 4CL gene family in Vanilla planifolia. In the present research, six PAL genes have been identified in Vanilla planifolia similar to the identification of six PALs in Malus domestica10. Eucalyptus grandis9, Leucaena leucocephala24 and Salvia miltiorrhiza83 consist of two C4H genes which is consistent with the number of C4H genes characterized in Vanilla planifolia. Further, five 4CL genes have been predicted in Vanilla planifolia similar to the model plant Oryza sativa73 and another plant Populus tomentosa34 which also possessed five 4CL genes. Thus, a varied number of PALs, C4Hs and 4CLs have been reported amongst various plant species.

All the VplPAL and VplC4H proteins showed the presence of lyase_aromatic and p450 domain, respectively while in Vpl4CL proteins two domains, AMP-binding and AMP-binding_C were predicted. The presence of the conserved domains implied a structural similarity between all the members of a gene family. Further, motif analysis showed that the predicted motifs consisted of the catalytically essential and substrate binding amino acid residues and the conserved distribution and arrangement of PAL, C4H and 4CL specific motifs also pinpointed the highly conserved nature of PAL, C4H and 4CL gene families making them an important part of the phenylpropanoid pathway.

Multiple sequence alignment of VplPAL proteins along with PALs of Arabidopsis thaliana and Oryza sativa showed a high degree of sequence similarity amongst them and the existence of five conserved domains [N-terminal, MIO (4-methylidene-imidazolone-5-one) domain, core domain, shielding domain and C-terminal domain]. The Ala-Ser-Gly triad of the MIO domain, which is crucial for the enzymatic activity of PALs, was also present in all of the proteins112,113. As the protein domains are distinct protein sequences forming discrete tertiary structures linked to specific functions like catalysis or binding, identifying the conserved domains in proteins is indicative of its molecular or cellular roles. All PALs shared the Phe-Leu residue that gives the PAL enzyme its substrate specificity, indicating that they all accept phenylalanine as a substrate114. The conservation of other catalytically active residues (GLALVNG, NDN, and HNQD) point towards the catalytic activity and conserved nature of all the discovered VplPALs and is in line with the research on PALs in Dendrobium candidum115 and Vanda coerulea116. Similarly, VplC4H proteins showed the presence of all residues related to C4H activity such as substrate binding sites, enzymatic active site, ERR triad, hinge motif and heme-iron binding domain similar to Camellia sinensis (28) and Saccharum spontaneum117. Further, sequence alignment of Vpl4CL proteins depicted the conservation of the two signature motifs Box I (SSGTTGLPKGV) and Box II (GEICIRG) of the 4CL proteins which are present in 4CL proteins in all plants43,78,118,119,120,121,122. Box I is the AMP (Adenosine monophosphate) nucleotide binding motif and is conserved in all the proteins belonging to the adenylate forming enzyme family123,124. In addition, the cysteine in the GEICIRG motif has a role in the stability and catalytic activity of 4CLs125. Thus, the conserved nature of amino acid residues in all these proteins highlighted the consistency in the functionality of these proteins.

A bioinformatics approach was employed to characterize the physicochemical properties of the deduced proteins. The average molecular weight and isoelectric point range of PAL proteins were similar to findings from research on other plants like Salvia miltiorrhiza82, Citrullus lanatus8, Salix viminalis126, Cucumis sativus and Cucumis melo60. The amino acid length of VplC4H2 protein (507aa) was found to be in equivalence to SmC4H1 (504aa) of Salvia miltiorrhiza83,127, SsC4H4.1–1 A (505aa) of Saccharum spontaneum117 and CsC4Ha and CsC4Hb (505aa) of Camellia sinensis28, and C4H8 of Fagopyrum tataricum63. The molecular weight and isoelectric point of SmC4H1 of Salvia miltiorrhiza83,127 was also comparable to the average molecular weight and isoelectric point of VplC4H proteins, respectively. Similarly, the protein length and molecular weight of Vpl4CL1 was similar to 4CL of Neosinocalamus affinis128. The cytoplasmic localization of PAL proteins in the current study was consistent with earlier reports55,59,88,129,130. Similarly, VplC4H proteins were located in the endoplasmic reticulum which is in line with GmC4Hs of Glycine max65 and the localization of Vpl4CL proteins in the peroxisome is conforming to reports on many 4CL members of Gossypium hirsutum43 and Pg4CL10 of Punica granatum78. Proteins identified in Vanilla planifolia in the present study had no transmembrane regions and likewise in other plants such as Salvia miltiorrhiza83, Boehmeria nivea (35) and Fagopyrum tataricum63, the trans-membrane helices are absent in PALs, C4Hs and 4CL members. The majority of the discovered PAL, C4H and 4CL proteins had instability indices less than 40, indicating towards their stable nature. Further, negative GRAVY value in specific proteins pointed towards their polar and hydrophilic character. PAL proteins of hydrophilic nature have been detected in Ornithogalum saundersiae131 and Cephalotaxus hainanensis56. However, positive GRAVY value in all the Vpl4CL proteins except Vpl4CL4 is in conformity with a previous report of hydrophobicity of Bn4CL3 of Boehmeria nivea35 and Pg4CL1-3, Pg4CL6 and Pg4CL8-11 of Punica granatum78. Thus, similarities in the physical parameters of the deduced PALs, C4Hs and 4CLs in the present study to the already identified members in other plant species corroborated the conserved nature of these proteins.

Secondary structure prediction showed that alpha helices and random coils constituted a major proportion in VplPAL and VplC4H proteins, hinting towards the importance of these secondary structure elements for structural stability and catalytic function. These results were also comparable with PAL and C4H proteins identified in other plant species14,27,56,72,130,132.

According to the phylogenetic analysis, the VplC4H proteins were predicted as putative class I members as they clustered along with the class I members of A. thaliana and O. sativa, thus suggesting their role in lignification in plants21. In the case of dicots, the 4CL proteins are classified into class I and class II wherein the class I proteins are linked to lignin accumulation and class II proteins play a role in the metabolism of other phenolic compounds29. However, in the case of monocots like Oryza sativa, a new phylogenetic category, class III is also present. It is speculated that this divergent evolution may be due to the variation in the phenolic compound composition in monocots and dicots and different substrate specificity of 4CL enzymes among these two groups of plants73. In the present study, the majority of the 4CL proteins of Vanilla planifolia (a monocotyledonous species) clustered in the class III clade. In addition, close evolutionary ties between the proteins were depicted by their close clustering in the phylogenetic tree.

The gene structure analysis showed the presence of one intron in almost all VplPAL genes which is in line with the previous studies on PAL genes in other plants like Oryza sativa and Carya illinoinensis55,61. In a similar manner, the results of gene structure analysis for VplC4H genes and Vpl4CL genes were in conformity with C4H genes of Brassica napus132 and Salvia miltiorrhiza83 and 4CL genes of Physcomitrella patens76 respectively. Additionally, for Vpl4CL genes, all the members shared a similar structural pattern except Vpl4CL1 which belonged to class I of 4CLs. Thus, the degree of similarity was more in genes belonging to the same class in the phylogenetic tree. Moreover, the similar organization of gene structure within a gene family indicates towards its plausible conservation during the plant evolution.

The presence of cis-regulatory elements typical of the VplPAL, VplC4H and Vpl4CL genes involved in the phenylpropanoid pathway was elucidated through promoter analysis. The analysis of the promoter regions revealed that these genes consisted of numerous phytohormone-responsive, abiotic and biotic stress-responsive, plant growth and development inducers and light-responsive cis-elements. In Punica granatum 4CL genes78 and Glycine max C4H genes133, similar cis-acting elements belonging to the aforementioned four categories were predicted. The presence of AC-II element in VplPAL1 and Vpl4CL3 is similar to the promoter of Na4CL of Neosinocalamus affinis128. The cis-acting elements found in the promoter sequences of VplPAL, VplC4H and Vpl4CL genes were also similar to those found in Salvia miltiorrhiza 4CL genes40. The presence of cis-acting elements regulating stress and phytohormone responses was also reported in 4CL genes of Gossypium hirsutum43 and Eucommia ulmoides44 and PAL genes of Salvia miltiorrhiza82 and Triticum aestivum88. In addition, specifically MeJA responsive elements were found in C4H members of Salvia miltiorrhiza83 and Saccharum spontaneum117 and PAL members of Carya illinoinensis55 similar to many PAL, C4H and 4CL members in our study. Apart from MeJA, cis-elements induced by gibberellins, abscisic acid and salicylic acid were also predicted during the present analysis which is in line with C4H genes in Saccharum spontaneum117. Thus, it is foreseeable that phytohormones may regulate the expression of PAL, C4H and 4CL genes.

The spatial variation in the expression of PAL, C4H and 4CL genes influences the spatial diversity of phenylpropanoids in plants. Hence, the expression profile for VplPAL, VplC4H and Vpl4CL genes in different plants tissues of Vanilla planifolia was analysed. The expression profile of VplPAL genes was found to be divergent amongst the various vegetative and reproductive tissues. Reduced expression in leaf tissue of ChPAL genes of Cephalotaxus hainanensis56, DcPAL of Dendrobium candidum115 and DcPAL1 of Dendrobium catenatum134 was in accordance with the expression of VplPALs. In addition, VplPAL2-4 were strongly expressed in fruit which could be corroborated with the high expression of PbPAL2 gene of Pyrus bretschneideri135 in the same tissue. This suggests that these genes may play a function in the mechanism of fruit development. The expression analysis of C4H genes of Vanilla planifolia showed that VplC4H1 and VplC4H2, both genes were expressed in elevated levels in the flower tissue. Likewise, SmC4H1 gene of Salvia miltiorrhiza83 and FtC4H6 of Fagopyrum tataricum63 were highly expressed in flowers, which might indicate the role of these genes in the mechanism of flowering. However, both genes showed very less expression in the soil root which is in line with the expression of BoC4H.2 and BoC4H.4 of Brassica oleracea136. Similarly, the overlapping and divergent expression pattern of Vpl4CL genes in different tissues was also analyzed. Additionally, high levels of expression were observed for Vpl4CL2 in fruit and likewise, Ri4CL2-3 of Rubus idaeus38 also showed enhanced expression in fruit. Thus, the results point towards the functional dispersion in PAL, C4H and 4CL gene family in Vanilla planifolia.

Thus, insights into plant metabolic pathways and their adaptation to various environmental situations can be gained by further study of the regulation of the enzymes involved in the pathways. Furthermore, understanding the mechanisms underlying the function of PALs, C4Hs, and 4CLs may also enable targeted genetic manipulations to enhance the overaccumulation of desired phenylpropanoid compounds in orchid plants.

Conclusions

An in-silico genome-wide characterization of PAL, C4H and 4CL gene family regulating the phenylpropanoid pathway in Vanilla planifolia was carried out and a total of six PAL, two C4H and five 4CL genes were identified. Domain analysis, multiple sequence alignment, conserved motif prediction, phylogenetic analysis, secondary structure prediction, gene structural analysis substantiated the highly conserved nature of these three gene families. Cis-regulatory elements prediction and expression profiling highlighted the role of these genes in plant growth and development. This study lays a basic frame for functional characterization of PAL, C4H and 4CL genes in orchids which would further help us in understanding the correlation of individual specific gene expression with the biosynthesis of phenylpropanoids in orchids.