Abstract
Euphorbia, one of the largest genera of flowering plants, is well-known for containing many biofuel crops. Euphorbia tirucalli, an evergreen succulent mainly native to the Africa continent but cultivated worldwide, is a promising petroleum plant with high tolerance to drought and salt stress. However, the exploration of such an important plant resource is severely hampered by the lack of a reference genome. Here, we present the chromosome-level genome assembly of E. tirucalli using PacBio HiFi sequencing and Hi-C technology. Its genome size was approximately 745.62 Mb, with a contig N50 of 74.16 Mb. A total of 743.63 Mb (99.73%) of the assembled sequences were anchored to 10 chromosomes with a complete BUSCO score of 97.80%. Genome annotation revealed 26,304 protein-coding genes, and 76.37% of the genome was identified as repeat elements. The high-quality genome provides valuable genetic resources that would be useful for unraveling the genetic mechanisms of biofuel synthesis and evolutionary adaptation of E. tirucalli.
Similar content being viewed by others
Background & Summary
Euphorbia, belonging to the family Euphorbiaceae, comprises about 2000 species and is one of the largest flowering plant genera in the world1. Many fuel plants have been reported in this genus, providing biomass for the production of biocrude, bioethanol and other bioenergy resources2,3,4,5. Euphorbia tirucalli (2n = 2x = 20)6, commonly referred to as milk bush, pencil cactus, pencil tree, or naked lady, is an evergreen shrub or small tree with typically succulent branched stems and small non-succulent leaves (Fig. 1a). It is naturally distributed in Indochina, South Africa, East Africa and Madagascar, and has been extensively cultivated as horticultural plant in other tropical or subtropical areas7. As one of the representative oil plants, E. tirucalli has long been considered a promising substitute for traditional energy sources. It exudes a milky latex from the wounded shoots or leaf stems3,8. Recent studies demonstrated that the compounds in the latex exhibit high petrochemical properties5,9,10. In addition, this species has important medicinal value, with its latex being traditionally used for the treatment of cancer, asthma, arthritis, rheumatism and so on11,12,13,14.
Overview of Euphorbia tirucalli genome assembly and features. (a) The picture of the sequenced E. tirucalli from South China National Botanical Garden (accession number: IBSC0312991). The inset map shows its succulent branched stems and small non-succulent leaves. (b) K-mer (17-mer) frequency distribution curve. (c) Distribution of genomic features of E. tirucalli. Tracks ‘a–f’ represent tandem repeat density, LTR Gypsy density, LTR Copia density, TE density, GC content, and gene density, respectively. (d) Hi-C interaction heat map for E. tirucalli.
Euphorbia tirucalli has high salinity and drought tolerance, which enables it to grow under a wide range of adverse conditions without occupying any arable land. Different genotypes exhibit distinct evolutionary adaptation to environmental stress15. In particular, the special photosynthetic system in E. tirucalli, i.e., the combination of C3 metabolism in non-succulent leaves and the Crassulacean Acid Metabolism (CAM) pathway in succulent stems, could efficiently maximize biomass accumulation7,16,17. Specifically, C3 promotes growth under favorable conditions, while non-succulent C3 leaves die quickly and CAM plays a critical role under drought stress, which could prevent damage from water limitation and ensure photosynthetic integrity16.
Due to the global energy crisis with the conventional fossil fuels and the associated environmental degradation18,19,20,21, E. tirucalli has received increasing attention in recent years, thanks to its fascinating petrochemical values and high tolerance to extreme habitats4,5,22,23. However, the utilization of such an important plant resource is severely hampered by the unavailability of genomic data. Therefore, a high-quality assembled genome of E. tirucalli is urgently required to uncover the genetic basis of both biodiesel production and stress resistance.
In this study, we performed a de novo chromosome-level genome assembly and annotation of E. tirucalli using PacBio HiFi sequencing and high-throughput chromosome conformation capture (Hi-C) technology. The assembled genome size of E. tirucalli was 745.62 Mb, with a contig N50 of 74.16 Mb. A total of 743.63 Mb (99.73%) of the assembled sequences were anchored to 10 chromosomes with a complete BUSCO score of 97.80%. The genome annotation identified 26,304 protein-coding genes. The high quality genome provides valuable genetic resources for further research on the genetic mechanisms underlying biofuel synthesis and adaptation to harsh conditions in E. tirucalli.
Methods
Sampling and sequencing
Fresh leaves of E. tirucalli were collected for whole genome sequencing from a healthy tree planted in the South China National Botanical Garden (accession number: IBSC0312991) (Fig. 1a). Additionally, tender leaves, mature leaves, young succulent stems, old stems, flowers, and roots were collected for transcriptome sequencing. All samples were immediately frozen in liquid nitrogen and stored at −80 °C.
Total genomic DNA was extracted using a cetyltrimethylammonium bromide (CTAB) method24. DNA quantity and quality were determined using a Qubit 4.0 fluorometer. Short read sequencing libraries with ~350 bp insert size were constructed and sequenced on the Illumina NovaSeq 6000 platform to generate 150 bp read pairs. For PacBio SMRT sequencing, a 15 kb DNA PacBio HiFi library was generated using the SMRTbell Express Template Preparation Kit 2.0. The library was then sequenced on the PacBio Sequel II platform, yielding 41.54 Gb of HiFi data with 53.96 × coverage. The Hi-C libraries were prepared by chromatin crosslinking, restricted enzyme (MboI) digestion, end filling and biotinylation tagging, DNA purification and shearing. All of the prepared DNA fragments were processed into paired-end sequencing libraries. Finally, a total of ~100 Gb of 150 bp paired-end Hi-C reads were obtained from the Illumina platform. For transcriptome sequencing, the RNA-seq libraries were constructed and then sequenced on the Illumina NovaSeq 6000 platform with 150 bp paired-end reads, generating about 18.8 Gb of RNA-seq reads.
Genome survey for genome size estimation
A total of 35 Gb of clean data from the Illumina platform was used for the genome survey. Genome size, heterozygosity and repeat content were estimated from the k-mer frequency distribution using GCE25. The 17-mer analyses yielded an estimated genome size of 789.98 Mb, with heterozygosity of 0.80% and repeat content of 74.67% (Fig. 1b).
De novo genome assembly
The PacBio HiFi reads were initially assembled into contigs using hifiasm v0.16.1-r37526 with default parameters. This analysis resulted in an assembly of 745.62 Mb and a N50 length of 74.16 Mb (Table 1). The size of the assembled genome is slightly smaller than our estimates (~760 Mb) by flow cytometry using Oryza sativa ssp. japonica (1 C = 0.43–0.45 pg) as a reference standard (Fig. S1). The completeness of the genome assembly was assessed using Benchmarking Universal Single-Copy Orthologues (BUSCO v5.4.3)27 with the embryophyta_odb10, which generated 97.8% of the Plantae BUSCO genes (Table 1). The accuracy of the draft assembly was further evaluated by mapping short reads to the genome assembly using the BWA-MEM v0.7.17-r118828, which yielded a mapping rate and genome coverage of 99.88% and 99.75%, respectively (Table 1). For pseudochromosome construction, Hi-C reads were aligned to the contig-level assembly using Juicer v1.629 with default parameters. We then used the 3D-DNA v201008 pipeline30 to correct mis-joins, anchor, order, and orient the assembled contigs. Manual inspections and adjustments of the draft assembly were performed using Juicebox v1.11.0831. Finally, approximately 743.63 Mb of scaffold was anchored to 10 pseudochromosomes, accounting for 99.73% of the assembled genome size (Fig. 1c,d). To evaluate the continuity of the genome assembly, the long terminal repeat retrotransposons assembly index (LAI), a reference-free genome metric for assessing the assembly of repeat sequences32 was calculated. The LAI value of the genome assembly was 19.05, which was close to the quality standard of a gold genome (LAI > 20) proposed by Ou et al.32. Collectively, these results validated the high completeness and reliability of our E. tirucalli genome assembly.
Repeat annotation
Tandem repeats were identified by ab initio prediction using TRF v4.0933. We used both de novo and homology predictions to annotate transposable elements (TEs) throughout the genome. For the homology-based strategy, RepeatMasker34 and RepeatProteinMask34 were used to extract the known repeat sequences. For de novo prediction, long terminal repeat (LTR) elements were first identified with LTR_FINDER v1.0.635, LTRharvest v1.5.1036 and LTR_retriever37. Then, MITE-hunter38 and RepeatModeler39 were used for de novo repeat discovery. The MITE and consensus repeat libraries generated by RepeatModeler were combined and subjected to RepeatMasker for final repeat identification. Overall, 569.43 Mb (76.37%) of the assembly was masked as repeats. Of these, 75.70% were TEs, including long terminal repeat retrotransposons (LTR-RTs) (61.11%), non-LTR-RTs (1.10%), and DNA transposons (6.56%) (Fig. 1c; Table 2).
Gene prediction and functional annotation
Gene prediction was performed using a combined strategy of homology-based, ab initio, and RNA-Seq-assisted predictions. In detail, Trinity v2.15.040 was used to de novo assemble the transcriptome for RNA-Seq-assisted prediction. Hisat2 v2.2.141 was utilized to map RNA-seq reads to the genome, and Samtools v1.942 was used to generate BAM file. Then Trinity and StringTie43 were utilized to assemble the genome-guided transcriptomes. The de novo and genome-guided transcriptome assemblies were merged, generating transcript evidence using PASA44. SNAP45, GeneID46, GlimmerHMM47, GeneMark-ET48 and AUGUSTUS49 were used for ab initio prediction with RNA-Seq-based predicted genes as training data. Homologies from five Euphorbiaceae species (E. lathyris, E. peplus, Ricinus communis, Hevea brasiliensis, and Manihot esculenta) and Arabidopsis thaliana were used as protein evidence for predicted gene sets using GeMoMa50. Finally, the results from the above three approaches were integrated using EVidenceModeler51 and further polished using PASA44. A total of 28,840 protein-coding genes were successfully predicted for E. tirucalli, with the average gene, intron and exons lengths of 4191.84 bp, 684.30 bp and 290.86 bp, respectively (Table 1).
For the functional annotation of protein-coding genes, a BLASTP (E-value ≤ 1e−5) search with the best match parameters was performed against publicly available protein databases of SwissProt and NR. Motifs and domains were annotated using InterProScan52 by searching against InterPro and Pfam. Gene ontology (GO) terms of the annotated peptide sequences were obtained using eggNOG-mapper v2.1.553. Finally, 26,304 protein-coding genes were functionally annotated, representing 91.21% of the total predicted genes (Table 1).
Data Records
The raw sequence data (Illumina, PacBio, Hi-C) have been uploaded to NCBI Sequence Read Archive (SRA) database with accession number SRR2788584254, SRR2788583455, and SRR2788583556, respectively, under the BioProject accession number PRJNA1070402. The RNA-seq data for different tissues are also available under PRJNA1070402 with accession numbers SRR2788583657, SRR2788583758, SRR2788583859, SRR2788583960, SRR2788584061, SRR2788584162. The final chromosome assembly has been deposited in NCBI GenBank with accession number JAZDXJ00000000063. The genome annotation file has been deposited in the Figshare database64.
Technical Validation
We assessed the quality of the genome assembly in the following aspects: (1) Genome completeness was assessed by BUSCO v5.4.34. The results indicated that 97.8% complete BUSCO genes were identified in the final assembly, of which 95.6% were single-copy, 2.2% were duplicated, and 0.4% were fragmented. (2) Mapping short reads to the genome assembly, which revealed a mapping rate and genome coverage of 99.88% and 99.75%, respectively. (3) The LTR Assembly Index (LAI) of the genome assembly is 19.05, which is close to the quality of a gold genome according to the classification system32. (4) Quality value (QV) and k-mer completeness were estimated using Merqury v1.365, resulting in a QV of 67.58 and completeness of 87.71%. These results indicate that the E. tirucalli genome assembly is of high quality.
Code availability
All software and pipelines used in this study were implemented according to the manuals and protocols provided by the software developers. Versions of the software have been described in Methods. No custom code was used in this study.
References
Bruyns, P. V., Mapaya, R. J. & Hedderson, T. J. J. T. A new subgeneric classification for Euphorbia (Euphorbiaceae) in southern Africa based on ITS and psbA-trnH sequence data. Taxon 55, 397–420 (2006).
Duke, J. Euphorbia tirucalli L., handbook of energy crops. Purdue University centre for new crops and plant products (1983).
Loke, J., Mesa, L. A. & Franken, J. Y. Euphorbia tirucalli biology manual: Feedstock production, bioenergy conversion, application, economics Version 2. FACT. (2011).
Hastilestari, B. R. et al. Euphorbia tirucalli L.–Comprehensive characterization of a drought tolerant plant with a potential as biofuel source. PLoS one 8, e63501 (2013).
Duy Khang, N. V., Nhi Nguyen, T. & Quan, T. L. Potential biofuel exploitation from two common Vietnamese Euphorbia plants (Euphorbiaceae). Biofuel. Bioprod. Biorefin. 17, 1315–1327 (2023).
Li, Z., Huang, J. & Li, L. Polyploid induction in Euphorbia tirucalli L. with Colchicine. J Southwest 29, 106–110 (2007). (in Chinese).
Van Damme, P. L. Euphorbia tirucalli for high biomass production. In Combating desertification with plants. (Boston, MA: Springer US, 2001).
Saigo, R. H. & Saigo, B. W. Botany, principles and applications. (Pren-tice-Hall, Englewood CliVs, 1983).
Calvin, M. Petroleum plantations for fuel and materials. Bioscience. 29, 533–538 (1979).
Calvin, M. Fuel oils from euphorbs and other plants. Bot. J. Linnean Soc. 94, 97–110 (1987).
Van Damme, P. Het traditioneel gebruik van Euphorbia tirucalli. Afrika Focus 5, 176–193 (1989).
Aylward, J. H. & Parsons, P. G. Treatment of prostate cancer. (Peplin Research May, US, 2008).
Kumar, A. Some potential plants for medicine from India. (Ayurvedic medicines, University of Rajasthan, Rajasthan, 1999).
Schmelzer, G. H. & Gurib-Fakim, A. Medicinal plants. (Plant Resources of Tropical Africa, 2008).
Abuelsoud, W., Hirschmann, F. & Papenbrock, J. Sulfur metabolism and drought stress tolerance in plants. Vol. 1: Physiology and Biochemistry 227–249 (2016).
Cushman, J. C. & Borland, A. Induction of Crassulacean acid metabolism by water limitation. Plant Cell Environ. 25, 295–310 (2002).
Janssens, M. J., Keutgen, N. & Pohlan, J. The role of bio-productivity on bio-energy yields. J. Agr. Rural Dev. Trop. 110, 41–48 (2009).
Zika, M. & Erb, K. H. The global loss of net primary production resulting from human-induced soil degradation in drylands. Ecol. Econ. 69, 310–318 (2009).
Wang, R. et al. Production and selected fuel properties of biodiesel from promising non-edible oils: Euphorbia lathyris L., Sapium sebiferum L. and Jatropha curcas L. Bioresour. Technol. 102, 1194–1199 (2011).
Dai, A. Increasing drought under global warming in observations and models. Nat. Clim. Change 3, 52–58 (2013).
Taparia, T., Mvss, M., Mehrotra, R., Shukla, P. & Mehrotra, S. Developments and challenges in biodiesel production from microalgae: A review. Biotechnol. Appl. Biochem. 63, 715–726 (2016).
Demirbas, A. Competitive liquid biofuels from biomass. Appl. Energy. 88, 17–28 (2011).
Sharma, R., Wungrampha, S., Singh, V., Pareek, A. & Sharma, M. K. Halophytes as bioenergy crops. Front. Plant Sci. 7, 1372 (2016).
Winnepenninckx, B., Backeljau, T. & De Wachter, R. Extraction of high molecular weight DNA from molluscs. Trends Genet. 9, 407 (1993).
Liu, B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv.org, arXiv: 1308.2012 (2013).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760.
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126–e126 (2018).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 5, 4.10.11–14.10.14 (2004).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. Bmc. Bioinformatics 9, 1–114 (2008).
Ou, S. & Jiang, N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl Acad. Sci. USA 117, 9451–9457 (2020).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Johnson, A. D. et al. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24, 2938–2939 (2008).
Guigó, R., Knudsen, S., Drake, N. & Smith, T. Prediction of gene structure. J. Mol. Biol. 226, 141–157 (1992).
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Lomsadze, A., Burns, P. D. & Borodovsky, M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 42, e119–e119 (2014).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–439 (2006).
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 1962, 161–177 (2019).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008).
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 49, D344–D354 (2021).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. Eggnog-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885842 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885834 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885835 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885836 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885837 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885838 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885839 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885840 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR27885841 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc:JAZDXJ000000000 (2024).
Wang, J. et al. Chromosome-level genome assembly of Euphorbia tirucalli (Euphorbiaceae), a highly stress-tolerant oil plant. Figshare https://doi.org/10.6084/m9.figshare.25224737 (2024).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 1–27 (2020).
Acknowledgements
We thank Associate Editor and anonymous reviewers for their helpful comments. This work was supported by Guangdong Province Basic Research Flagship Project (2023B0303050001), National Natural Science Foundation of China (No. 31970360), Natural Science Foundation of Guangdong Province (2022A1515011293), and Youth Innovation Promotion Association CAS (2021348).
Author information
Authors and Affiliations
Contributions
J.W. and M.K. conceived and designed the study. J.W. and X.S. prepared the samples. Z.W., C.F. and J.X. analyzed the data. J.W. and W.Z. wrote the manuscript. M.K. revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wei, Z., Feng, C., Xu, J. et al. Chromosome-level genome assembly of Euphorbia tirucalli (Euphorbiaceae), a highly stress-tolerant oil plant. Sci Data 11, 658 (2024). https://doi.org/10.1038/s41597-024-03503-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-024-03503-w
This article is cited by
-
Evolution of linear triterpenoid biosynthesis within the Euphorbia genus
Nature Communications (2025)



