Introduction

Piper sarmentosum Roxb., a perennial creeping herb in the Piperaceae family, is primarily distributed along the southeastern coast of China and in tropical regions of Southeast Asia1. As a dual-purpose medicinal and edible plant, P. sarmentosum has a long history of use in traditional Chinese medicine. Its entire plant can be used as medicine and is known for its functions of dispelling wind and cold, promoting blood circulation, and reducing swelling. It is commonly used to treat conditions such as rheumatic pain and trauma-related swelling2. Modern research indicates that P. sarmentosum extract (PSE) possesses a variety of biological activities, such as antioxidant, analgesic, antimicrobial, and anti-inflammatory effects. These activities are likely related to its rich phytochemical composition, which includes alkaloids, amides, and phenolic compounds3,4,5,6. Among these, amide alkaloids are one of its most abundant chemical constituents7, showing potential for development as novel anticancer drugs8,9,10.

The anticancer activity of Piper species is closely related to the presence of various bioactive compounds, particularly amides, such as pipelamide11,12. Among these compounds, piperlongumine (PL) was first isolated from Piper longum, and subsequent studies have found that P. sarmentosum also contains this important compound13. The α,β-unsaturated carbonyl group in the molecular structure of PL is the key pharmacophore responsible for its pharmacological activity, making it an ideal candidate molecule for the development of multitarget drugs13. Studies have shown that PL selectively induces apoptosis in tumor cells while maintaining the activity of normal cells, which gives it a unique advantage in cancer treatment. Its mechanisms include: inducing apoptosis in tumor cells by increasing reactive oxygen species (ROS) levels; inhibiting tumor metastasis-related signaling pathways; and regulating the expression of proteins associated with various cancers, such as breast cancer, colon cancer, prostate cancer, and lung cancer14,15,16. In addition to its anticancer activity, PL also demonstrates significant anti-inflammatory, antidiabetic, antidepressant, neuroprotective, and cardiovascular protective effects, making it an ideal candidate for multitarget drug development17,18. Currently, the chemical synthesis of PL can be achieved through reactions involving 3,4,5-trimethoxy-cinnamic acid and 5,6-dihydro-2(1H)-pyridinone with 1,3-dicyclohexylcarbodiimide (DCC) as a coupling agent, or by using Grubbs I and Grubbs II catalysts. Several new synthetic methods have been developed, although the overall reaction steps remain relatively complex and the conditions harsh19,20,21,22,23,24,25. However, with the development of biotechnology, biosynthesis has gradually become a simpler and more efficient synthesis route due to its highly selective catalysis26. Previous studies have found that the content of PL in P. longum L. fruits is only 1.2 mg/g, while P. sarmentosum contains approximately 40% more. The elucidation and analysis of the biosynthetic genes and mechanisms of complex natural products occur through the key steps of biosynthesis27. The key biosynthetic enzyme gene of PL has not yet been identified, which limits the possibility of achieving its large-scale production through synthetic biology methods28.

On the other hand, the Piperaceae family is a large plant family with over 3600 species, among which the Piper genus and the Peperomia genus are the most representative29. To date, only P. nigrum in the Piperaceae family has completed chromosome-level genome sequencing and assembly30. Although the genome of P. longum, a species morphologically similar to P. sarmentosum, has preliminary genomic information, it is still at the contig level, containing 89,204 scaffolds with an N50 value of 10.7 kb, which presents significant fragmentation issues31. This limits its application in gene function research and the analysis of complex gene structures. In this study, we combined PacBio HIFI and high-throughput chromosome conformation capture (Hi-C) technologies to successfully obtain a high-quality genome of P. sarmentosum. Through comparative genomics analysis, we revealed a whole-genome duplication (WGD) event in the Piperaceae plant genome, and by integrating transcriptomic and metabolomic data, we identified and validated a series of key genes in the biosynthetic pathway of PL. Our results provide an important foundation for understanding the genomic evolution of Piperaceae plants as well as alkaloid research.

Results

Genome sequencing, assembly, and quality assessment

To estimate the initial characteristics of the P. sarmentosum genome, we generated 18.35 Gb of high-precision HiFi data with an average read length of 22.61 kb using the PacBio platform. K-mer analysis (K = 21) was performed to evaluate the genome size, repetitive sequences, and the heterozygosity rate of the genome. The survey plot revealed that this plant has a highly heterozygous and repetitive genome, with an estimated genome size of approximately 526.56 Mb and a heterozygosity rate of about 0.81% (Fig. 1A). Subsequently, these HiFi reads were assembled into a 548.44 Mb draft genome using Hifiasm, consisting of 674 contigs (N50 = 15.36 Mb). After several rounds of assembly, polishing, comparison, extension, and gap filling, we obtained 137 scaffolds, which included 102 gaps. The N50 of the scaffolds is 22.52 Mb, and the longest scaffold is 38.88 Mb. We constructed a high-throughput Hi-C library for P. sarmentosum, generating 85 Gb of Hi-C paired-end reads. The allelic sequences of 127 scaffolds were anchored to 26 chromosomes, achieving an anchoring rate of 97.79%, with an N50 of 34 Mb and an L50 of 10 Mb. Chromatin interaction data indicate that our Hi-C assembly is of high quality (Fig. 1B). To assess the integrity of the genome assembly, we further investigated the telomere and centromere structures. Except for chromosome 21, where telomeres were not detected, and chromosomes 12, 16, and 20, where only one side of the telomere was detected, all other chromosomes showed detectable telomeres with the telomere repeat monomer being AAACCCT/TTTGGGA. Centromere regions, ranging from 7.57 Mb to 130 kb, were predicted, and these regions were found to be associated with repetitive sequences and negatively correlated with gene distribution, further demonstrating the integrity of the genome assembly (Fig. 1C). We determined that the BUSCO score for the P. sarmentosum genome was 97.4% (Table 1), higher than the published value for P. nigrum (BUSCO: 96.1%) and P. longum (BUSCO: 96.0%). Additional validation using the LTR Assembly Index (LAI = 12.68) to assess the integrity of long terminal repeats (LTRs) further confirmed that the assembled P. sarmentosum genome is of sufficiently high quality to serve as a reference genome (Table 1 and Supplementary Data 1).

Fig. 1: Genome Architecture and Characteristics of P. sarmentosum.
figure 1

A K-mer distribution (21-mer) spectrum, indicative of the genome complexity and size of P. sarmentosum, as determined by k-mer analysis. B Hi-C chromatin interaction heatmap for the 55 pseudomolecules of the P. sarmentosum genome, revealing chromatin interaction regions that may play a crucial role in gene expression regulation. C The genomic landscape of P. sarmentosum. The features, from outside to inside, are: pseudochromosomes (1), GC content (0%–50%). (2), gene density (0–1). (3), TE repeat density (0–1). (4), LTR repeat density (0–1). (5, 6, 7, 8, 9), root, stem, leaf, flower, and fruit Transcripts Per Kilobase of exonmodel per Million mapped reads (TPM). (line). Intra-genome collinear blocks with > 10 gene pairs are highlighted with arcs in the middle of the diagram. Different colored lines connect matched gene pairs between different chromosomes. Circos was used to construct the diagram. All distributions were drawn using a window size of 1 Mb, except for expression data, which was drawn using a window size of 50 kb. “Chr” refers to chromosome. D The insertion time distribution of LTR elements in the genome of giant ragweed. MYA: indicates million years ago. E Phylogenetic relationship of Ty1-Copia LTR-RTs. F Phylogenetic relationship of Ty3-Gypsy LTR-RTs.

Table 1 Statistics of the genome assembly for P. sarmentosum

Genome Annotation

De novo prediction revealed that the P. sarmentosum genome contains 326.88 Mb of repetitive sequences, accounting for approximately 57.88% of the total genome size (Supplementary Data 2). These repeats consist of 31.82% retrotransposons, 4.10% DNA transposons, 18.78% unclassified repeats, and 1.12% satellite repeats. Among them, LTR retrotransposons represent the most abundant class of repetitive DNA, with Copia and Gypsy elements comprising 5.13% and 25.69% of the genome, respectively. These repetitive elements are unevenly distributed across chromosomes, showing a strong preference for centromeric regions (Fig. 1C). The insertion time distribution of Long Terminal Repeat Retrotransposons (LTR-RTs) reflects the historical activity of transposable elements (TEs) and selective pressures32. Active LTR-RTs inserted within or near functional genes can induce phenotypic variation, and the amplification and contraction of LTR-RTs significantly affect genome size33,34. In our analysis of the P. sarmentosum and P. nigrum genomes, we observed significant differences in the insertion times and durations of the Copia and Gypsy superfamilies of LTR-RTs (Fig. 1D). For the Gypsy family, the insertion time in P. sarmentosum was estimated to be 34.45 MYA, whereas in P. nigrum, it was significantly more recent at 15.90 MYAs, with P. nigrum containing more insertions than P. sarmentosum. This recent expansion of Gypsy elements in P. nigrum suggests that, following the divergence of the two species, Gypsy elements underwent a sustained period of growth. As for the Copia family, the insertion time in P. nigrum was recorded as 57.72 MYA, whereas in P. sarmentosum, insertion events continued intermittently between 1.0 and 2.67 MYA. Further phylogenetic analysis, based on coding region sequence homology, allowed us to classify the Copia and Gypsy elements into distinct lineages. Among these, the most abundant members of the Ty1/copia and Ty3/gypsy families were identified as Tork and Tekay, respectively (Fig. 1E, F). Functional prediction analysis of genes containing specific LTR retrotransposons inserted within approximately 5000 base pairs revealed that these insertions could impact gene functions in several ways. These include transposition, RNA-mediated processes, N-acylethanolamine metabolic processes, DNA integration, regulation of ubiquitin-protein transferase activity, cellular glucose homeostasis, terpenoid biosynthesis, and defensive response regulation (Supplementary Data 3). The rapid amplification of repeat elements likely played a crucial role in the genomic evolution of P. sarmentosum and its divergence from closely related species. This suggests that LTR-RTs, particularly Gypsy and Copia elements, have been important contributors to the genetic diversification and adaptation of P. sarmentosum, potentially influencing its phenotypic traits and evolutionary history.

Using a combined approach of de novo prediction, homology-based annotation, and transcriptomic evidence, we annotated a total of 39,154 protein-coding genes, with an average gene length of 3312 bp and an average coding sequence (CDS) length of 1151 bp. These genes exhibit an uneven chromosomal distribution, with a pronounced bias toward telomeric regions, and display distinct expression patterns across different tissues (Fig. 1C). Although the number of annotated genes is significantly lower than that reported for the P. nigrum genome, our annotation achieves a completeness score of 98.83%, substantially higher than the previously reported 83.6% for P. nigrum (Supplementary Data 1). Functional annotations were assigned to the predicted genes using five major databases: TrEMBL, SWISS-PROT, Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and InterPro. In total, 20,937 (86.24%), 16,603 (38.03%), 31,276 (44.55%), 30,171 (94.14%), 35,034 (73.29%), and 35,232 (84.48%) gene models were successfully annotated in GO, KEGG, COG, PFAM, Swiss-Prot, and NR, respectively (Supplementary Data 4). Over 90% (35,266) of the genes were functionally annotated. Furthermore, we classified 2403 transcription factors (TFs) from 58 gene families, representing 6.14% of the total CDS. Among these, basic helix-loop-helix (bHLH), ethylene-responsive element-binding factors (ERF), and myeloblastosis (MYB) families were the most prominent. Additionally, we identified 646 chromatin regulators (CRs), 157 transcriptional regulators (TRs), and a variety of non-coding RNAs (ncRNAs), including 1138 small RNAs, 175 miRNAs, 652 tRNAs, 1081 rRNAs, and 846 snRNAs.

Gene Family Evolution and Phylogenetic Reconstruction and WGD

We conducted gene family analysis by comparing the protein-coding sequences of P. sarmentosum with those of 14 other sequenced plant genomes. A total of 503,044 genes from the 15 species were subjected to gene family clustering, resulting in the identification of 31,805 orthologous gene families. Among these, 273 gene families were single-copy genes, while 5853 gene families were shared by all species (Fig. 2A and Supplementary Data 5). Among the 15 species, P. nigrum had the highest number of single-copy orthologs (5383), while P. sarmentosum had the fewest (1668). The highest number of multicopy orthologs was identified in P. nigrum, with 18,483 multicopy orthologs. Furthermore, the gene family clustering analysis revealed that 8429 gene families were shared by all five Magnoliids, while 468 gene families were unique to the P. sarmentosum genome, far fewer than the 3088 gene families in P. nigrum; of these, 3838 orthologs are shared within the Piper genus (Fig. 2B and Supplementary Data 5). To investigate the functions of the shared genes within the Piper genus, we conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses on the 3838 shared gene families. The GO term enrichment analysis showed that these genes are mainly involved in “binding,” “transport,” “metabolic processes,” and “signal transduction” (Supplementary Data 6). In contrast, KEGG enrichment analysis indicated that most of these genes were associated with pathways related to the biosynthesis of secondary metabolites, such as “Stilbenoid, diarylheptanoid and gingerol biosynthesis,” “Phenylpropanoid biosynthesis,” and “Terpenoid biosynthesis” (Supplementary Data 6). Additionally, we performed GO functional annotation for 382 unique genes, which revealed that these genes are enriched in biological processes such as “DNA unwinding involved in DNA replication,” “regulation of helicase activity,” “mannose transmembrane transport,” “triterpenoid biosynthetic process,” “regulation of cellular response to heat,” and “membrane protein ectodomain proteolysis” (Supplementary Data 7).

Fig. 2: Comparative Genomics and Evolutionary Dynamics of Gene Families.
figure 2

A Copy number distribution of gene families in 15 species. B Comparative analysis of orthologous gene families among five species. C Phylogenetic analysis and assessment of gene family expansion and contraction based on whole-genome data. D GO enrichment analysis characterizing expanded gene families in the P. sarmentosum genome, associated with 20 different functions. E Ks density distributions. F Interspecific collinearity plot.

To investigate the phylogeny and evolutionary trajectory of P. sarmentosum, a genomic-scale phylogenetic tree was constructed based on 273 strict single-copy orthologous genes from 15 plant genomes, with Amborella trichopoda as the outgroup. The maximum likelihood (ML) method was employed for tree construction (Fig. 2C and Supplementary Data 8). As expected, the phylogenetic tree places P. sarmentosum as the sister group to the Piper species. The divergence of the Piperaceae family is estimated to have occurred around 92.20 MYA, and the divergence between P. nigrum and P. sarmentosum is estimated to have occurred approximately 17.56 MYA. Further analysis using CAFE 5 revealed that 4430 gene families underwent expansion in the P. nigrum lineage, while 447 gene families contracted (Fig. 2C). In P. nigrum, 2967 gene families expanded, while P. sarmentosum exhibited 2746 expanded gene families. Among these, the Piper species showed contraction in 1406 gene families, while P. sarmentosum had contraction in 813 gene families. Gene Ontology (GO) functional enrichment analysis of all expanded gene families revealed that they are significantly involved in processes such as alkaloid metabolic process (GO:0009820), cellular amide metabolic process (GO:0043603), jasmonic acid metabolic process (GO:0009694), response to wounding (GO:0009611), and mismatch repair (GO:0006298) (Supplementary Data 9). In addition, the specific gene expansions in P. sarmentosum were notably enriched in genes related to secondary metabolism, disease resistance, and growth/development (Fig. 2D). These include BAHD acyltransferases (28 genes), sulfotransferases, disease resistance proteins (e.g., RPS5, RPP13, RGA1, RGA2, RGA3, RGA4, RGA5), NBS-LRR disease resistance proteins, LRR receptor-like serine/threonine protein kinases (e.g., EFR, BAM1, FLS2, GSO2, EMS1), jasmonate (28 genes), and auxin-inducible genes (65 genes). These gene expansions may indicate that biotic stress from pathogen infection and herbivory are the primary selective pressures acting on P. sarmentosum in tropical rainforests.

Whole-genome duplication is considered a major evolutionary force that generates additional genetic material, enhancing the adaptability and plasticity of plants, which in turn promotes species diversification and functional innovation35. In this study, we used MCScanX to infer synteny within the genome of P. sarmentosum. Our analysis identified 739 syntenic blocks across the entire genome, containing 25,308 genes, which account for 65.76% of the total gene count. Among these syntenic blocks, 689 (93.33%) of the paralogous genes were located between chromosomes, while the remaining 50 (6.77%) were located within chromosomes (Fig. 1C). The analysis of paralogous gene duplication types in P. sarmentosum indicated that most genes were classified as resulting from WGD or segmental duplications (25,308 genes, 65.76%). This suggests that a large proportion of P. sarmentosum’s genes experienced duplication during the evolutionary process. To gain deeper insight into the evolutionary history of the P. sarmentosum genome, we compared its synonymous substitution rate (Ks) with the genomes of P. nigrum, Cinnamomum camphora, Liriodendron tulipifera, and grape to identify homologous gene pairs and assess the synteny relationships between these species. The Ks curve rates of syntenic gene pairs indicate that the P. sarmentosum genome has undergone three rounds of WGD: the first being an ancient triplication event (Ks ≈ 1.53), consistent with the γ event of the angiosperm ancestor, shared by all core eudicot lineages. The second was a β duplication event shared by magnoliids (Ks ≈ 0.88), consistent with the Ks distribution region of other magnoliids, followed by a recent triplication event (Ks ≈ 0.15). Additionally, we conducted a synteny analysis using Aristolochia fimbriata, P. sarmentosum and P. nigrum, revealing a ratio of 1:8:8, consistent with previous studies on P. nigrum, further confirming the three rounds of WGD in P. sarmentosum36.

Metabolite Profiling and Biosynthesis Pathways of Piperine and PL in P. sarmentosum

As a distinctive plant, P. sarmentosum has not been extensively studied in terms of its metabolomics. This study uses non-targeted metabolomics to investigate the distribution of metabolites across different tissues of P. sarmentosum.A total of 4456 metabolites were identified, including 1290 amino acids and derivatives, 535 phenolic compounds, 535 organic acids, 238 alkaloids, 182 phenolic acids, 161 flavonoids, and other types of metabolites (Fig. 3A and Supplementary Data 10). The distribution of metabolites shows distinct tissue specificity (Fig. 3B). To assess the tissue-specific metabolites in P. sarmentosum, we conducted a weighted gene co-expression network analysis (WGCNA), which identified 15 distinct metabolic modules. Among these, the Brown, Turquoise, Blue, Yellow, and Green modules were found to be specific to each tissue, collectively representing 84.56% of the total metabolites (Fig. 3C). KEGG pathway enrichment analysis of these core modules revealed significant enrichment in the “Valine, leucine, and isoleucine biosynthesis” and “Glycolysis / Gluconeogenesis” pathways across all modules (Fig. 3D and Supplementary Data 11). Notably, the Brown and Turquoise modules showed significant enrichment in the “Biosynthesis of various alkaloids” pathway. We further investigated and observed that the relative content of piperidine alkaloid metabolites, such as Piperine, Piperitone, Guineensine, Pyrrolidine, Trichostachine, Piperlonguminine, and PL, is notably higher in the root, flower, and fruit tissues (Fig. 3E). This differential distribution may reflect a tissue-specific role of these alkaloids in plant defense mechanisms, as alkaloids are known to act as deterrents against herbivores and pathogens. (Fig. 3E and Supplementary Data 10).

Fig. 3: Untargeted metabolomics analysis of P. sarmentosum.
figure 3

A Heatmap analysis of metabolites across different tissues. B Principal component analysis (PCA) of various samples. C WGCNA module analysis of all metabolomic data. D KEGG enrichment analysis of metabolites from five specific modules. E Relative content of piperidine alkaloid metabolites in different tissues. MJ represents stem, ML represents leaf, MR represents root, MF1 represents flower, and MF2 represents fruit.

P. sarmentosum contains piperine, which is not only the major amid alkaloid in the fruit but also the key source of its spicy flavor and biological activity. However, the piperine content in P. sarmentosum is much lower than in P. nigrum37. Therefore, understanding its biosynthetic pathway is crucial for developing crops with high piperine content. Current research on piperine biosynthesis mainly involves three gene groups: The first group includes genes in the phenylpropanoid pathway (map00940), which catalyze a series of complex reactions to produce Feruperic acid, a precursor for the biosynthesis of piperoyl-CoA38. The second group consists of genes involved in L-lysine metabolism (map01064), which catalyze the conversion of lysine into piperidine through a series of reactions, including decarboxylation, amine oxidation, cyclization, and reduction38,39,40. The third group is acyltransferase genes, which catalyze the synthesis of piperidine alkaloids in the presence of piperoyl-CoA and piperidine (Fig. 4A). However, the biosynthetic pathway of PL remains unclear. Based on its chemical structure, we hypothesize that the key steps in its synthesis involve the reaction between Sinapoyl-CoA and 5,6-dihydropyridin-2(1H)-one, catalyzed by the BADH enzyme, forming an amide compound, which is then methylated by the OMT enzyme (Fig. 4A). Notably, this step shares many genes with the biosynthesis of piperine (Fig. 4A, B). Through homology searches and functional annotations, we identified 110 candidate genes that are likely involved in the biosynthesis of piperine alkaloids. Expression analysis showed distinct expression patterns of these genes across different plant tissues (Supplementary Data 12). Most of the genes in the phenylpropanoid pathway exhibit high expression levels, including PsSK (Pbechr23G000696, Pbechr8G000852), PsEPSPase (Pbechr5G001181), PsCS (Pbechr10G000850, Pbechr8G000653), PsCM (Pbechr22G000529, Pbechr26G000427, Pbechr8G001394), PsPAL (Pbechr4G002064, Pbechr5G000287, Pbechr3G001915), PsC4H (Pbechr16G000464, Pbechr23G000605), PsCOMT (Pbechr1G002005), Ps4CL (Pbechr2G001132, Pbechr8G001369), PsFSD1 (Pbechr7G001890) and PsCAO (Pbechr18G000248, Pbechr3G001234, Pbechr4G000818). These genes are highly expressed in multiple tissues, with the root, flower, and fruit showing the highest levels (Fig. 4A, B).

Fig. 4: Biosynthetic Pathways of Piperine and PL, and BADH Gene Family Analysis.
figure 4

A Biosynthetic Pathway of Piperine. B Predicted Biosynthetic Pathway of PL. C Phylogenetic tree of the BADH gene family. The red star indicates the Piperamide synthases from P. sarmentosum. D Expression heatmap (row-scaled) of PsBADH genes across different organs based on transcriptome data. Note: The heatmaps show TPM-based expression levels of genes in five plant tissues: leaf (L), stem (S), root (R), flower (F1), and fruit (F2). Data were log₁₀-transformed, with each row representing a gene and each column a tissue. For certain biosynthetic steps, multiple candidate genes or homologs were identified, resulting in multiple rows. Gene-to-step assignments are provided in Supplementary Data 12.

Acyltransferase (BADH) is a key enzyme in the biosynthesis of piperine-like compounds. In this study, we conducted a gene family analysis and identified 126 BADH genes in P. sarmentosum, 155 in P. nigrum, 51 in A. fimbriata species, and 63 in Arabidopsis thaliana. These genes include two characteristic motifs found in all BAHD-like enzymes: HXXXD and DFGWG. A total of 126 PsBAHD genes were distributed across 25 chromosomes, with 51 of these genes (40.48%) located in WGD gene clusters, suggesting that the WGD event has played an important role in the amplification of PsBAHD genes (Supplementary Data 12). Phylogenetic trees were constructed using the Maximum Likelihood (ML) method, and based on the reference by Liu et al., the BAHD family proteins were classified into six branches (I, II, III, IV, V, and VI) (Fig. 4C)41. Notably, previous transcriptome studies have identified two genes involved in piperine synthesis: Piperine synthases (MW354956), which specifically catalyze the synthesis of piperine using piperoyl-CoA and piperidine as substrates, and Piperamide synthases (MW354957), which have broader substrate specificity and catalyze reactions involving various CoA esters and aliphatic or aromatic amines, producing a range of piperamide derivatives39. We found that in P. sarmentosum, only Piperamide synthases (Pbechr3G000899) are present, while Piperine synthases are absent. Furthermore, this gene is predominantly expressed in the root, which could explain the lower content of piperine in P. sarmentosum compared to P. nigrum. The remaining PsBAHD genes are mainly expressed at higher levels in the flowers and fruits of P. sarmentosum, suggesting that these genes may play a crucial role in the biosynthesis of piperamide and other alkaloids in the flowers and fruits (Fig. 4D and Supplementary Data 12). Biosynthetic genes for PL still require further analysis.

Identification of key genes involved in PL biosynthesis

To investigate the genes and regulatory network involved in the synthesis of PL, we constructed a WGCNA network using 39,154 genes from 15 samples. Genes with low expression variation (standard deviation ≤ 0.8) were filtered out, leaving 19,362 genes, which were then divided into 16 distinct modules, each marked with a different color. The modules are presented in a cluster dendrogram, network heatmap, and trait heatmap, with the gray module representing genes that were not assigned to any specific module and have no reference significance (Fig. 5A). Through Pearson correlation analysis, we associated each co-expressed module with alkaloid and lysine pathway metabolites. The results showed that most alkaloid-related modules were negatively correlated with L-Lysine and 5-Aminopentamic acid. Specifically, the ‘Orange’ and ‘Darkturquoise’ modules showed significant correlations with L-Lysine and 5-Aminopentamic acid ImC, while the ‘Darkolivegreen’ module was significantly associated with Pipericine and Piperine. Notably, PL was significantly correlated with the ‘Darkgreen’ module, which showed high expression during the flowering and fruiting stages. Therefore, we hypothesize that genes within the Darkgreen module may be closely involved in the synthesis of PL (Fig. 5B). Further analysis revealed that the Darkgreen module contained 621 genes, and their expression levels were significantly upregulated in the flowers and fruits (Fig. 5B). KEGG functional analysis indicated that most of these genes were enriched in metabolic pathways such as Biosynthesis of secondary metabolites, Metabolic pathways, Terpenoid backbone biosynthesis, Fatty acid degradation, Plant hormone signal transduction, Nicotinate and nicotinamide metabolism, Monoterpenoid biosynthesis, and MAPK signaling pathway – plant, among others. GO functional analysis showed that these genes are involved in flower development and the biosynthesis of alkaloid metabolites (Supplementary Data 13). To identify key genes involved in the synthesis of PL, we selected the top 200 genes with the highest ImC values from the Darkgreen module and visualized the relationships between highly connected genes using Cytoscape software (Fig. 5C). Among these genes with the highest connection degrees (KEM), we identified several key genes, such as PsHCT1 (Shikimate hydroxycinnamoyl transferase, Pbechr25G000561), PsCYP450, PsAO, PsPK, PsCER3, and others. In addition, several transcription factors, including MYB, bHLH, and CBF, were also identified. These transcription factors may play an important role in maintaining the development of flower and fruit organs, as well as in regulating the synthesis of amide alkaloids (Supplementary Data 14).

Fig. 5: WGCNA analysis of piperine-type metabolites content-associated genes across tissues and identification of key biosynthetic genes.
figure 5

A Hierarchical clustering dendrogram of 16 co-expression modules. B Relationship between modules and the content of core metabolites. Each row and column represents a module and a metabolite, respectively. Left panel: 16 color-coded modules; right panel: correlation heatmap (blue-to-red scale: -1 to 1); The number in parentheses represents the p-value. C Identification of top 200 hub genes in the blue co-expression module. D The subcellular localization of PsHCT1 in tobacco leaves, including images of the empty vector control and PsHCT1-GFP under GFP-field. Co-expression images of the plasma membrane marker (PIP2A-RFP) and PsHCT1-GFP. Merged images of bright-field, GFP-field, and RFP-field. Scale bar, 20 μm. E Molecular docking of PsHCT1 with sinapoyl-CoA and predicted reaction product. F Molecular docking of PsCCoAOMT 1 with PL precursor and PL.

Notably, PsHCT1 is the most highly connected hub gene within this Darkgreen module, and it has the highest expression level among all the BADH gene family members. The PsHCT1-like gene encodes a protein of 427 amino acids, with an encoding sequence of 1281 base pairs. Subcellular localization analysis revealed that PsHCT1 is localized in the cytoplasm, consistent with previous studies42,43 (Fig. 5D). To further explore the function of this gene, we modeled PsHCT1 based on the best template, the X-ray crystal structure of BAHD from rosemary (PDB accession number: 6MK2). We further investigated the interactions between the compounds Sinapic acid-CoA and 5,6-Dihydropyridin-2(1H)-one and the PsHCT1 gene, using the HCT crystal structure as the receptor. As shown in Fig. 5E, Sinapic acid forms hydrogen bonds with the residues Asp-298 (6.5 Å), SER-378 (3.4 Å), LEU-35 (2.5 Å), His-158 (1.5 Å), and Asp-162 (3.41 Å), and interacts with several other amino acid residues in this range. On the other hand, 5,6-Dihydropyridin-2(1H)-one forms a hydrogen bond with PRO-37 (1.1 Å). Notably, 5,6-Dihydropyridin-2(1H)-one forms a 2.9 Å hydrogen bond with the acyl group of Sinapic acid (Fig. 5E). The molecular docking results, which align with structure-activity relationships, show a strong binding affinity with an affinity value of -6.2 kcal/mol. Additionally, we predicted molecular docking between the PsHCT1gene and the product. The results indicated that the product forms hydrogen bonds with ARG-363 (8.7 and 6.5 Å), LYS-221 (5.5, 5.5 Å), and PRO-37 (6.1 Å), demonstrating a high binding force. The molecular docking results, which are consistent with structure-activity relationships, also show an affinity of -6.3 kcal/mol, indicating a potential binding interaction (Fig. 5E).

In the biosynthesis of PL, OMT enzymes play a crucial role in the addition of methoxy groups. Subsequently, through gene family analysis, we identified 96 OMT genes. A phylogenetic tree was constructed using Arabidopsis as a reference, revealing that the tree is divided into two families: PsCCoAOMT (35 genes) and PsCOMT (61 genes) (Supplementary Data 15). Based on the chemical structure of PL, we hypothesize that its biosynthesis is closely related to the PsCCoAOMT subfamily, which exhibits highly active gene expression, with PsCCoAOMT1 (Pbechr20G000401) showing the highest expression level (Supplementary Data 15). We then performed molecular docking predictions between the PsCCoAOMT1 gene, PL precursors, and the PL molecule. The results showed binding energies of -7.7 kcal/mol and -8.0 kcal/mol, indicating strong binding affinity (Fig. 5F). This suggests that the enzyme may catalyze the addition of the methoxy group to the PL precursor through stable interactions. To summarize, through WGCNA, correlation analysis, high expression levels, and strong binding predictions from molecular docking, we identified PsHCT1 and PsCCoAOMT1 as key genes involved in PL synthesis. However, this result still requires further experimental validation.

Discussion

P. sarmentosum is a widely recognized traditional medicinal and edible plant, attracting increasing attention due to its therapeutic and ornamental value2. The genus Piper is of significant commercial and economic importance, widely used in the spice, pharmaceutical, and pesticide industries, and also having a long history of use in traditional practices44. Previous phytochemical research on Piper species has led to the isolation of amide alkaloids, lignans, neolignans, and phenylpropanoids as the main components3,4,5,6. These isolated compounds exhibit a broad range of biological effects, including antifungal, anticancer, anti-inflammatory, and antioxidant activities, sparking widespread interest2. Although many chemical constituents have been isolated and identified from this species, the biological activities of only a few purified compounds have been studied. Therefore, it is crucial to explore the genomic evolution and biosynthesis mechanisms of active compounds in P. sarmentosum. In this study, we assembled a high-quality P. sarmentosum genome using PacBio HiFi and Hi-C sequencing, achieving a Scaffold N50 of 22.52 Mb, with BUSCO and QV quality assessments confirming the genome’s high completeness compared to P. nigrum and P. longum30,31. Notably, the completeness of the gene annotation in P. sarmentosum, as predicted by BUSCO, was 98.6%, significantly higher than P. nigrum’s 83.6%30. This provides an important reference for the genomic analysis of other Piper species, making it an ideal model plant for functional genomics research in the Piper genus.

In this study, significant differences in genome size were observed between P. sarmentosum (527.85 Mb) and P. nigrum (761.2 Mb). Transposable elements (TEs) are an important factor contributing to this size difference45. We found that the repetitive sequences in P. sarmentosum amounted to 305 Mb, constituting 57.88% of the total sequence, while P. nigrum had 417.52 Mb of repetitive sequences, accounting for 54.85%. Previous studies have shown that Ty3/Gypsy elements have been identified as the main repetitive components driving genome expansion in all large-genome species46. In our study, LTR retrotransposons comprised 31.82% of the P. sarmentosum genome, significantly lower than the 40.55% observed in the P. nigrum genome, which may be a crucial factor contributing to the genome size difference30. Our analysis of insertion times revealed that P. sarmentosum and P. nigrum underwent different LTR-type bursts over the past 3.5 MYA. Notably, LTR-RT bursts peaked at 0.25 and 0.5 MYA. During this period, the Earth experienced climate oscillations known as the Pleistocene, and the insertion of LTRs into the genome may have influenced the expression of resistance genes, thereby enhancing the plant’s adaptability to the environment47,48. However, more thorough and in-depth research is needed to understand the underlying processes that led to LTR bursts. WGD play a crucial role in plant genome evolution, often leading to sudden increases in gene content and genome size. Qin et al. used specialized computations to find that P. nigrum underwent three rounds of WGD36. In this study, two α and β WGD events near the Cretaceous-Paleocene boundary were observed at Ks values of ~0.15 and ~0.88 in P. sarmentosum. Additionally, we were able to clearly observe the γ triplication event common to core eudicots at Ks ~1.53 without the need for special calculations. Our research confirms that Piper species underwent three rounds of WGD, with P. nigrum showing no clear γ triplication event, which may be related to its relatively low BUSCO score. On the other hand, the three WGDs in P. sarmentosum are consistent with its higher chromosome number (2n = 52), based on the basic chromosome number (x = 13), suggesting that this plant is polyploid. WGD events have facilitated the expansion of genes related to secondary metabolism, disease resistance, and growth/development, similar to the expansion in P. nigrum, with high expression in roots and fruits. This suggests that pathogenic infection and herbivory pressure are the major selective pressures faced by P. sarmentosum in tropical rainforests30.

Natural products have always been an important source of potential lead compounds, and natural products and their structural analogs have made significant contributions to the treatment of diseases throughout history49. P. sarmentosum contains diverse bioactive compounds, including alkaloids, amides, and phenolic compounds, which have well-documented antioxidant and anti-inflammatory activities. Its crude extracts exhibit inhibitory effects against various microbial pathogens and cancer cell lines, highlighting its potential as a source of novel anticancer agents11,12. In this study, metabolomic analysis identified 4456 metabolites, among which 238 alkaloids demonstrated significant pharmacological activity. Notably, amide-type compounds such as PL, piperine, and PL exhibit multi-target bioactivities, providing molecular evidence for the traditional “medicinal-food homology” properties of P. sarmentosum and supporting its potential as a natural drug repository. Since its first isolation from P. nigrum in 1820, a series of piperamides including piperamide, pipernonaline, and guineensine, have been identified38. Piperine, one of the most abundant compounds in Piper species38, serves as a key quality marker in P. nigrum, with reported concentrations ranging from 29.57 to 54.23 mg/g50. Previous studies have reported that the piperine content in P. sarmentosum (7.9 ± 0.1 mg/g) is significantly lower than that in P. nigrum37. However, the exact reasons for this disparity remain unclear. Studies have identified two key enzymes involved in the biosynthesis of piperine in P. nigrum: piperine synthase, which synthesizes piperine by combining piperoyl-CoA and piperidine, and piperamide synthase, which has broad substrate specificity and can bind various CoA esters with both aliphatic and aromatic amines, catalyzing different reactions39. Both of these enzymes are BAHD-type acyltransferases, encoded by genes that are preferentially expressed in the developing fruits of P. nigrum39.Through comparative genomics, we discovered differences in the BADH gene copy number between P. sarmentosum (126 genes) and P. nigrum (155 genes). Notably, only piperamide synthase was identified in P. sarmentosum, which may explain the lower piperine content in P. sarmentosum. The piperine synthase in P. nigrum originated from a dispersed event, and the absence in P. sarmentosum still requires further investigation with additional Piper species to fully understand this phenomenon. Our transcriptomic and metabolomic analyses revealed that piperamides, including piperamide, pipernonaline, and guineensine, are predominantly enriched in roots, consistent with prior findings38. In contrast, leaves, stems, flowers, and immature fruits exhibit lower alkaloid content, suggesting tissue-specific metabolic partitioning. The high concentration of piperamides in roots may function as a defense mechanism against soil pathogens, potentially acting synergistically with terpenoids to form an effective protective system.

Beyond piperine, amide alkaloids represent one of the most abundant phytochemical classes in P. sarmentosum. PL, a bioactive compound first isolated from P. longum, exhibits remarkable anticancer properties. Its α, β-unsaturated carbonyl moiety is critical for pharmacological activity13, enabling selective induction of tumor cell apoptosis and suppression of metastasis. This compound demonstrates efficacy against multiple cancer cell lines, including breast, colon, prostate, and lung cancers17,18. Despite its therapeutic potential, PL’s low natural abundance poses extraction challenges. While P. longum fruits contain only 1.2 mg/g, P. sarmentosum fruits accumulate higher levels (1.7 mg/g)28. This may be related to resistance against biological stress. Although some studies report elevated concentrations in P. longum roots, agricultural practices primarily harvest flowers and fruits51. Our metabolomic data confirms that P. sarmentosum predominantly synthesizes PL in reproductive tissues, aligning with cultivation needs. Given its chemical structure, we hypothesize that Sinapyl-Coenzyme A, a key metabolite in lignin biosynthesis, serves as the precursor for PL synthesis52. This compound is primarily synthesized through the phenylpropanoid pathway. Through integrated transcriptomic-metabolomic mining, we identified two key precursor enzymes, PsHCT1 and PsCCoAOMT1, involved in the PL biosynthetic pathway. Molecular docking predictions show a high probability of binding between these enzymes and their respective substrates and products, suggesting their potential role in catalyzing key biochemical reactions in PL synthesis. Transcription factors such as bHLH, MYB, and WRKY are known to regulate secondary metabolism, including the synthesis of alkaloids and other bioactive compounds, in response to environmental stress53,54,55,56. Our weighted gene co-expression network analysis (WGCNA) suggests a strong correlation between these transcription factors and the expression of PsHCT1, indicating that they may enhance PL production by modulating the activity of these enzymes. While these computational results are promising, further experimental evidence is required to validate the involvement of PsHCT1, PsCCoAOMT1, and the transcription factors in PL biosynthesis. Ferulic acid, the best precursor for Sinapyl-Coenzyme A biosynthesis, is catalyzed by F5H to form 5-hydroxyferulic acid, which is subsequently converted into sinapate by COMT57. Sinapate undergoes hydroxylation and methylation by key enzymes, such as 4CL and COMT, to form CoA derivatives58. These key enzymes are well-characterized, facilitating the construction of chassis strains for de novo PL synthesis. Interestingly, 5,6-dihydropyridin-2(1H)-one, another precursor for PL synthesis, is typically synthesized through chemical methods, which are costly and complex59,60. Although the biosynthetic pathway has not yet been fully validated, we have identified PsHCT1 and PsCCoAOMT1 as key enzymes potentially involved in PL synthesis and hypothesize that transcriptional regulation through stress-responsive factors could effectively increase PL production. Overall, our study provides valuable directions for further validating these findings and optimizing the biosynthetic pathway for efficient PL production.

Conclusions

Through integrated multi-omics analysis, this study has achieve the chromosome-level genome assembly of P. sarmentosum, significantly advancing genomic research in the Piper genus. We have systematically elucidated the biosynthetic mechanisms of the key bioactive compound PL, providing crucial theoretical foundations for the genetic improvement of medicinal plants and the development of plant-derived pharmaceuticals.

Materials and methods

Plant Material, DNA Extraction, and Sequencing

The P. sarmentosum plant used in this study was originally collected from a spice plantation in Lufeng City, Guangdong Province, China (35°2’ N, 104°3’ E) and subsequently relocated to the Guangdong Provincial Crop Germplasm Resource Nursery (23°16’ N, 113°27’ E) for long-term maintenance under natural conditions. The sample ID is GD2021441581. All genomic sequencing samples were derived from fresh leaves of a single plant. DNA was extracted using a modified cetyltrimethylammonium bromide (CTAB) method, as previously described61. After thoroughly grinding the P. sarmentosum leaf tissues, 5 mL of 2% CTAB, 50 μL of β-mercaptoethanol, and 26 μL of proteinase K were added, and the mixture was well mixed. The samples were incubated at 65 °C for 30 min. After incubation, the samples were centrifuged at 12,000 rpm for 10 min, and the supernatant was collected. Next, an equal volume of phenol/chloroform/isoamyl alcohol (25:24:1) was added to the supernatant and mixed thoroughly. The mixture was then centrifuged at 12,000 rpm for 10 minutes. This step was repeated twice, with the supernatant collected each time. Then, an equal volume of chloroform/isoamyl alcohol (24:1) was added, the mixture was mixed again, and centrifuged at 120,000 rpm for 10 min. The supernatant was collected. Subsequently, pre-chilled isopropanol (4 °C) was added, and the mixture was inverted to mix and incubated at 4 °C for 30 min to precipitate the DNA. The samples were centrifuged at 12,000 rpm for 10 min to collect the DNA precipitate. The precipitate was washed twice with 1 mL of pre-chilled 70% ethanol (4 °C), centrifuging at 12,000 rpm for 5 min each time, and then air-dried at 37 °C. Finally, the dried DNA pellet was dissolved in Tris-HCl buffer, treated with RNase A, and incubated at 37 °C for 30 min to remove RNA. The final DNA solution was stored at -20 °C. The concentration and quality of the extracted DNA were evaluated using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA). For long-read sequencing, a PacBio SMRTbell library was constructed using the SMRTbell Prep Kit 3.0 (Pacific Biosciences, Menlo Park, CA, USA) according to the manufacturer’s protocol. Sequencing was performed on the PacBio Revio platform (PacBio Sequel II system) to generate high-fidelity (HiFi) data. To improve the continuity of the genome assemblies, we created Hi-C libraries for the young leaves of the same plant using the GenSeq Hi-C library preparation kit (Cloud-Seq Bio Co., Ltd., China, Shanghai). The kit involves crosslinking, DpnII digestion, biotin labeling, ligation, shearing, and biotin capture62. Sequencing of the Hi-C fragments was conducted on the DNBSEQ-T7 platform (MGI, Shenzhen, China).

De novo Genome Assembly

First, HiFiAdapterFilt v3.0.063 was used to remove reads with residual PacBio adapter sequences from the HiFi sequencing data and perform quality control. Jellyfish v2.3.164 was used to count k-mers in the HiFi sequencing data with the parameter kmer=21. Genomescope.R v2.065 was employed to estimate the genome size, setting the maximum k-mer coverage threshold to 10,000. The initial genome assembly of PacBio Revio sequencing platform HiFi reads was performed using Hifiasm v0.15.566 with default parameters. Subsequently, the genome was polished three times using Pilon v1.2467. Redundant overlapping contigs were removed using Purge_Haplotigs v1.1.268. The final round of gap filling for the draft genome was conducted using NextDenovo v2.17-r94169 and TGS_Gapcloser2 v1.9.470. Next, the Hi-C reads were aligned to the assembled genome using Juicer v1.6071 software, allowing us to obtain interaction data between chromatin regions across contigs. The 3D-DNA v18092272 pipeline was used to automatically generate candidate chromosome-level assemblies to correct erroneous connections, orders, and orientations, and to organize the contigs into the draft chromosome assembly. Errors in the chromosome-level assembly were manually corrected in Juicebox v2.20.0073, and Hi-C reads were aligned to the chromosome-level assembly using HiCExplorer v3.7.274 to evaluate the clarity of structural domains across different chromosomes. The final chromosome-level genome assembly was assessed using BUSCO v3.1.075 based on the embryophyta_odb10 gene set. Minimap2 v2.2876 was used to map PacBio HiFi reads to the assembled genome to evaluate coverage and average depth. Genome quality was assessed using LTR_retriever v2.9.68177. Finally, Mequery v1.378 was used to evaluate the assembly’s overall quality (QV) and completeness.

Genome Annotation

The identification of known TEs in the genome is performed using RepeatMasker v4.1.079 with the Repbase TE library, and the TE protein database is queried using RepeatProteinMask. Subsequently, RepeatModeler2 v1.0.1180 is employed to construct de novo repeat libraries for each genome, followed by detailed analysis, refinement, and classification. LTRs are identified with LTR_Retriever v2.9.68177 and further classified using TESorter v1.4.68281. The terminal regions of LTRs are aligned with Mafft v7.5058482, and insertion times for LTRs are calculated using the formula T = K/2r, where K represents the genetic distance, and r is the substitution rate of 1.3e-8 substitutions per site per year. An unrooted phylogenetic tree is constructed using the neighbor-joining method with FastTree v2.1.118583. For accurate protein-coding gene prediction, a combination of transcriptomic, homology-based, and de novo approaches is used. De novo predictions are made with Augustus84, GenScan85, and GlimmerHMM86, utilizing the Arabidopsis training set. Homology-based predictions employ GeMoMa87, incorporating protein sequences from P. nigrum and other species. Transcriptomic predictions align the de novo transcriptome assembly with the genome and refine gene structures using PASA. Gene model consensus sets are generated using EVidenceModeler88, excluding single-exon genes supported solely by transcriptomic data and genes with fewer than three exons. To further enhance the reliability of annotations, TransposonPSI v2.2.2689 is used to identify transposon-related genes, while Pseudogene Pipeline identifies pseudogenes. Pseudogenes and transposon-related genes with FPKM < 1 are excluded. Gene function annotations are performed using EggNOG-mapper90, Diamond, and InterProScan, providing annotations for GO, EC, KEGG pathways, COGs, and other functional categories. Non-coding RNA annotations are performed using Infernal v1.1.4 and tRNAscan-SE v2.0.9. The integrity of the genome or protein sequences is assessed using BUSCO75 with the embryophyta_odb10 plant reference set.

Phylogenetic Analysis

The longest protein sequences from each gene model of 15 plant species were extracted and subjected to an all-versus-all Diamond alignment. Homologous genomes were identified using the default parameters of OrthoFinder v3.0.1b191 software. The aligned tandem amino acid sequences were further refined using MAFFT v7.27182, followed by trimming with trimAI v1.4.rev22. A maximum likelihood phylogenetic tree was constructed using RAxML v8.2.1292 with the PROTGAMMAJTT model and 1000 bootstrap replicates, using A. trichopoda as the outgroup. The species tree topology was calibrated using MCMCtree v4.993, incorporating two temporal calibration points: Node 1 (the common ancestor of A. trichopoda and O. sativa) estimated to be between 179 and 205 MYA; Node 2 (H. annuus and C. arabica) estimated to be between 97.5 and 109.2 MYA, with data sourced from the TimeTree94 database. A log-normal independent molecular clock model was used to generate 1,000,000 samples, discarding the first 25% as burn-in. The MCMCtree R package was employed to visualize the reliable divergence time intervals of each node in the species tree. Gene family expansion and contraction were assessed using CAFE v4.2.195, and conditional P-values for each gene family were calculated. Gene families with P-values < 0.05 were considered to have significantly accelerated rates of expansion or contraction. Gene families with more than 100 copies in a single species were removed. Homologous protein sequences between species were aligned using Diamond v0.9.29.13096 (E-value < 1e-5, C-score > 0.5), and syntenic blocks were identified using MCScanX v1.0.097. To investigate the evolution of the P. sarmentosum genome, the synonymous substitution rate (Ks) for collinear gene pairs was calculated using wgdi v1.1.198.

Transcriptome and Metabolomic Analysis

The transcriptome sequencing was performed on five different tissues, including roots, stems, mature leaves, flowers, and fruits. Each sample consisted of three biological replicates. Total RNA was extracted using the Trizol reagent kit (Invitrogen, Carlsbad, CA, USA), and its quality was assessed using the Agilent 2100 Bioanalyzer. Eukaryotic mRNA was enriched with Oligo(dT) beads, while prokaryotic mRNA was enriched by removing rRNA using the Ribo-ZeroTM Magnetic Kit (Epicentre, Madison, WI, USA). The cDNA libraries were constructed using the NEBNext UltraTM RNA Library Prep Kit for Illumina (New England BioLabs, Beijing, China). The resulting cDNA library was sequenced on the MGI T7 platform (MGI, Shenzhen, China) using a 2 × 150 bp paired-end run, with the experimental procedures following standard plant RNA extraction protocols. Sequencing was performed by Yuda Biotechnology Co., Ltd (Guangzhou, China). After filtering low-quality reads with fastp v0.23.499, clean reads were mapped to the P. sarmentosum genome using HISAT2 v.2.2.1100. Reads were quantified using featureCounts. Reads were quantified using featureCounts, and gene expression values were represented as TPM. We then used WGCNA v1.73101 to filter genes with low expression variation (standard deviation ≤ 0.8), setting the power value to 30 and using other default parameters for analysis. Tissue-specific gene clusters were identified based on module-metabolite correlations (Pearson |cor | > 0.95). The clusterProfiler v4.0102 package was used to perform GO and KEGG enrichment analysis on the genes in the identified tissue-specific modules.

In this study, a TripleTOF 6600+ mass spectrometer (SCIEX, Foster City, CA, USA) was used for metabolite analysis, providing high sensitivity and high-resolution qualitative and quantitative analysis. The combination of liquid chromatography (LC) and mass spectrometry ensured the complete process from substance separation to substance identification. Sample collection and transcriptomics analysis were consistent, with samples taken from root, stem, mature leaf, flower, and fruit tissues, each using three biological replicates. Samples were initially freeze-dried using a freeze dryer (Scientz-100F) and then ground into powder using a grinder (MM 400, Retsch) at a frequency of 30 Hz for 1.5 min to achieve a uniform powder. For extraction, 50 mg of powdered sample was weighed and extracted with 1200 µL of pre-chilled 70% methanol aqueous extraction solution at -20 °C, with shaking for 30 s every 30 min for a total of six cycles. The mixture was then centrifuged at 12,000 rpm for 3 min, and the supernatant was collected, filtered, and stored in vials for UPLC-MS/MS analysis. HPLC analysis was performed using a Shimadzu LC-30A ultra-high-performance liquid chromatography system (Shimadzu, Japan) equipped with a Waters ACQUITY Premier HSs T3 column (1.8 µm, 2.1 × 100 mm) in positive ion mode, with a mobile phase consisting of 0.1% formic acid in water (A) and 0.1% formic acid in acetonitrile (B). The gradient elution program was as follows: 5-20% B (2 min), 20–60% B (3 min), 60–99% B (1 min), held for 1.5 min, and then returned to 5% B. The column temperature was set at 40 °C, with a flow rate of 0.4 mL/min and an injection volume of 4 µL. Negative ion mode analysis was conducted using the same gradient. Mass spectrometry data were collected in Information Dependent Acquisition (IDA) mode, with source gas pressure set at 50 psi, temperature at 550 °C, spray voltage at ±5000 V, collision energy at ±30 V, scanning range of 50-1000 Da, accumulation time of 200 ms, product ion scanning range of 25-1000 Da, and collision energy at ±30 V, monitoring a maximum of 18 ions. Peak areas were corrected using the SVR method. The data used in this study were derived from mass spectrometry (MS) analysis, which generated mass spectral features that were annotated and compared with known metabolite databases. The corrected peaks were used for metabolite identification by searching the laboratory’s self-built database, integrating public databases, prediction libraries, and the metDNA method. By setting criteria of a composite score greater than 0.5 and a sample coefficient of variation (CV) less than 0.5, metabolic features that met these conditions were selected from the initial UPLC-MS/MS features. Finally, metabolites with a composite score above 0.5 and an OC sample CV value less than 0.5 were extracted, and the positive and negative modes were then merged to obtain the relative content of all metabolites. Subsequent metabolomic analysis was conducted exclusively based on these selected metabolic features to ensure the reliability and consistency of the results.

Alkaloid Biosynthesis Gene Identification and Molecular Docking

To identify candidate genes involved in the biosynthesis pathway of alkaloids, BLAST searches were performed on the Piper genome assembly with an E-value threshold of ≤ 1e-5, using protein sequences from A. thaliana and Piper species as queries. The sequences encoding shikimate kinase (PsSK), 3-phosphoshikimate 1-carboxyvinyl transferase (PsEPSPase), chorismate synthase (PsCS), chorismate mutase (PsCM), aromatic-amino-acid transaminase (PsPAT), prephenate dehydratase (PsPDT), phenylalanine ammonia-lyase (PsPAL), trans-cinnamate 4-monooxygenase (PsC4H), caffeic acid 3-O-methyltransferase (PsCOMT), 4-coumarate:Coenzyme A ligase (Ps4CL), lysine decarboxylase (PsLDC), feruloyl-CoA diketide synthase (PsFDS1), CYP98A (PsC3H), ferulate-5-hydroxylase (PsF5H), copper amine oxidase (PsCAO) and acyltransferases (PsBADH2) were retrieved from NCBI. These gene sequences were used as baits to identify homologous genes in the P. sarmentosum genome.

Subsequently, candidate genes were analyzed using the Pfam and SMART databases to verify conserved domains and motifs. Proteins with identical conserved domains were considered homologous. To identify BAHD and OMT proteins in P. sarmentosum, we downloaded the Pfam seed files for OMT genes (PF00891, PF01596) and BAHD genes (PF02458, PF07247) from the Pfam database (http://pfam.xfam.org/). HMMER v3.0103 and local BLAST were employed to search for sequences containing BADH and OMT protein domains. The identified protein sequences were aligned to remove redundant sequences. A hidden Markov model (HMM) wssssas then reconstructed for the detected proteins, and sequences with corresponding Pfam domains were searched again. Further validation of BAHD and OMT family members was performed using the online tools SMART and CDD.

AlphaFold 3 performs homology modeling of the mutated sequences of PsHCT1 and PsCAOMT1. The best model is selected based on molpdf, DOPE, and GA341 potential scores. The geometric structure and quality of the model are checked online (http://servicesn.mbi.ucla.edu/SAVES/). Docking analysis is carried out using AutoDock Vina, and the results are analyzed using PyMOL software.

Subcellular localization prediction and validation for PsHCT1

To predict the subcellular localization of PsHCT1, we employed ProtComp version 9 (http://linux1.softberry.com/berry.phtml), which compares the target protein with homologous proteins of known localization and utilizes data from the LocDB and PotLocDB databases. The coding region of PsHCT1 was amplified from cDNA, which was synthesized through reverse transcription of RNA extracted from leaf samples collected during an RNA-seq experiment. This amplified sequence was then cloned into the pCAMBIA2300 vector downstream of the 35S promoter, resulting in the construction of a PsHCT1-GFP fusion gene (Forward Primer: ATGGGAAATGGTGATTTTAGTGTGTC; Reverse Primer: TCAAGAGTTGAATCCAAGATAGTC). Subsequently, the Agrobacterium suspension containing these constructs was combined with a plasma membrane marker (PsHCT1-RFP) at an optical density (OD600) of 0.6, and this mixture was infiltrated into tobacco leaves. After 72 hours, the GFP and RFP signals were visualized using a Zeiss LSM980 confocal fluorescence microscope.

Statistics and reproducibility

In this study, HiFi and HiC sequencing used for genome assembly were performed as biological replicates once. RNA-seq and metabolomic profiling were each conducted with three biological replicates. For transcript-metabolite network analysis, a correlation coefficient > 0.9 or < -0.9, and a p-value < 0.05, were used as screening criteria. Subcellular localization was also performed with three biological replicates. GO and KEGG enrichment analyses (P < 0.05) were carried out using the clusterProfiler package, with multiple testing corrections applied using FDR, and the significance level was set at P < 0.05. All data are expressed as mean ± standard deviation. During the manuscript preparation, we employed a large language model (ChatGPT-5) for AI-assisted text editing, primarily for grammar checking, language translation, and optimizing text expression. The AI was used solely to enhance the readability and linguistic quality of the text, and it did not participate in research design, data analysis, or content creation. The final manuscript was reviewed and approved by the authors to ensure it reflects their original work.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.