Background

Intestinal parasite infections, which represent a significant global public health concern, disproportionately affect marginalized communities with limited access to clean water and sanitation facilities1,2,3,4. According to the World Health Organization (WHO), an estimated 3.5 billion people are at risk of intestinal parasite infection, and of this number, approximately 1.5 billion people currently suffer from some form of intestinal parasitic infection5.

Intestinal parasites constitute a diverse group of organisms, including helminths such as nematodes (roundworms), trematodes (flatworms), and cestodes (tapeworms)6, as well as protozoa such as Giardia lamblia and Entamoeba histolytica7. These insidious pathogens represent a major threat to public health and often lead to severe morbidity, malnutrition, and even mortality6,8. It has also been reported that understanding disease-related pathogens and accurately diagnosing infectious illnesses are paramount for the development of effective control and prevention strategies9. This need has spurred decades of research and innovation in the field of parasitology.

Conventional methods for parasite detection, including microscopic examination10,11, polymerase chain reaction (PCR)12,13,14,15,16, and enzyme-linked immunosorbent assay (ELISA)17,18,19,20,21,22,23,24, still play a crucial role in the diagnosis and monitoring of intestinal parasite infections. However, they do have some limitations.

The accuracy of microscopy with respect to the identification of intestinal parasites depends on the skill level of the operating technician, and microscopic examinations may not lead to the detection of infections when the number of parasites present is low. It can also be time-consuming and labor-intensive, requiring trained personnel and specialized equipment25,26,27,28. Further, serological assays, such as ELISA, show promise for diagnosing parasitic infections, but when used in isolation, they are prone to showing higher rates of false results, especially in cases where there is cross-reactivity among antigens from different parasite species29. Furthermore, one of the drawbacks of PCR is its requirement for meticulously designed primers tailored to specific target parasites30, and designing these primers demands an in-depth understanding of the parasite’s genetic makeup. Thus, the process is often time-consuming and expensive.

Therefore, new strategies are needed for screening multiple samples for various pathogens, and the advancement of molecular diagnostics requires rigorous validation of lab-developed assays for accurate infectious disease diagnosis31. Recent advances in molecular biology and next-generation sequencing (NGS) technologies have opened new avenues for the rapid and accurate screening of multiple parasite species. Specifically, metabarcoding, a methodology that enables the simultaneous screening of multiple parasite species within a single sample, has shown great promise in the field of parasitology32,33,34,35,36.

In this study, 11 intestinal parasite species were screened using 18 S ribosomal RNA (rRNA) gene amplicon-based NGS. The purpose of this study was to simultaneously detect these 11 intestinal parasites using metabarcoding and investigate the optimization of library preparation protocols for NGS so as to shed light on factors affecting NGS results in relation to parasitic infections. The results obtained may be useful for improving diagnostic accuracy and may ultimately aid public health efforts to control and prevent intestinal parasitic infections.

Methods

DNA extraction

Helminth samples (Ascaris lumbricoides, Clonorchis sinensis, Dibothriocephalus latus, Enterobius vermicularis, Fasciola hepatica, Necator americanus, Paragonimus westermani, Taenia saginata, Trichuris trichiura) preserved in ethanol as specimens and protozoa samples (Giardia intestinalis, Entamoeba histolytica) cultured in the laboratory of the Department of Tropical Medicine, Yonsei University College of Medicine, were used in this study37,38. The DNA of the parasites was extracted using the Fast DNA SPIN Kit for Soil (MP Biomedicals, Carlsbad, CA, USA) according to the manufacturer’s protocol. The DNA samples were stored at -80 °C until needed.

Thymine-adenine (TA) clone targeting of the 18 S rRNA gene V9 region

PCR was performed to amplify the V9 region of the 18 S rRNA gene of the parasites using the individual DNA samples. The primers used were 1391 F (5’- GTACACACCGCCCGTC-3’) and EukBR (5’-TGATCCTTCTGCAGGTTCACCTAC-3’). The cloning of the amplicons of the 18 S V9 regions of the 11 intestinal parasites under study was performed using the TOPcloner TA Kit (Enzynomics, Daejeon, Korea) in accordance with the manufacturer’s instructions. The recombinant colonies obtained were then stored at -80 °C until use. In brief, the cloned plasmids were extracted using the Exprep Plasmid SV Mini Kit (GeneAll, Seoul, Korea) after culturing overnight in Luria–Bertani broth containing ampicillin. Finally, the concentrations of the extracted plasmids were measured using a Quantus™ fluorometer (Promega, Madison, WI, USA). The flow chart shows the outline of sample preparation and amplicon sequencing for this study (Fig. 1).

Fig. 1
figure 1

Flow chart outlining the sample preparation and amplicon sequencing process.

Restriction enzyme for plasmid linearization

To minimize the steric hindrance of the circular plasmids and primers, the plasmids were linearized using a restriction enzyme, NcoI (Thermo Scientific™, Waltham, MA, USA) at a concentration of 10 U/µL, which has one restriction site in all 11 types of plasmids. Three groups of samples were prepared for the amplicon NGS analysis: The first group comprised 11 samples that were not treated with the restriction enzyme. The second group comprised 11 samples that were first pooled and then simultaneously treated with the restriction enzyme. The third group comprised 11 samples that were treated individually with the restriction enzyme and thereafter pooled. There were two plasmid concentrations: 20 ng/µL and 2 ng/µL.

Illumina sequencing for eukaryotic metabarcoding

The plasmids of the 11 parasite species were amplified using primers targeting the 18 S rRNA V9 region, with adaptors for NGS attached to the primers: 1391 F (5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG GTACACACCGCCCGTC-3′) and EukBR (5′-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG TGATCCTTCTGCAGGTTCACCTAC-3′)33. We chose to amplify the V9 region of the 18 S rDNA due to its potential to efficiently capture a broader range of eukaryotes on the Illumina sequencing platform39,40. The master mix, KAPA HiFi HotStart ReadyMix (Roche Sequencing Solutions, Pleasanton, CA, USA), contained the primers and 3 µl of pooled total DNA, derived by diluting each of the 11 plasmid DNAs to equal concentrations based on the lowest measured plasmid concentration of 20 ng/µl. PCR amplification was then performed using an Applied Biosystems Veriti 96-Well Fast Thermal Cycler (Thermo Scientific™, Waltham, MA, USA) as follows: 95 °C for 5 min, 30 cycles of 98 °C for 30 s; 55 °C for 30 s; 72 °C for 30 s, and a final extension step of 72 °C for 5 min. A limited-cycle (8-cycle) amplification step was also performed to add multiplexing indices and Illumina sequencing adapters. Thereafter, mixed amplicons were pooled and sequenced on an Illumina iSeq 100 sequencing system using the Illumina iSeq™ 100 i1 Reagent v2 kit (Illumina Inc., San Diego, CA, USA) according to the manufacturer’s protocol. In addition, to evaluate the effect of annealing temperature on the NGS output during the amplicon PCR process, various annealing conditions ranging from 40 to 70 °C, in 3 °C increments, were tested for library preparation.

Bioinformatic procedures

Quantitative Insights Into Microbial Ecology v2 (QIIME 2™) (2023.2) was used to analyze the iSeq 100 data41. Low-quality sequence reads were demultiplexed and trimmed using Cutadapt (v4.5)42. Thereafter, the trimmed reads were denoised and dereplicated. Chimera reads were filtered out using the DADA2 (v1.26)43, a widely used noise reduction algorithm in 18 S rDNA metabarcoding44,45,46,47. To obtain a table for the taxonomic assignation of amplicon sequence variant sequences, we utilized the complete set of sequences available in the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nuccore/) as it encompasses a broader range of parasite sequences compared to curated databases. We conducted an advanced search for gene names, specifically “18S rRNA”48. Thereafter, we extracted sequences from the NCBI database to construct a database for vertebrates and parasites. Subsequently, clustered sequences with 100% identity were compared against the 18 S rRNA sequences in the database to generate the classification table. The taxonomical classification of the representative sequences was performed using a feature classifier based on the consensus search method49. Unassigned reads (0.07% of the total reads) were removed from subsequent analyses.

Prediction of 18 S rDNA V9 secondary structure

The DNA secondary structure of the 18 S rDNA V9 region of each parasite was predicted, and the number of intra-GC pairs was counted using Vector Builder software (https://www.vectorbuilder.kr/tool/dna-secondary-structure.html) and Geneious Prime (version 2023.2.1, https://www.geneious.com).

Statistical analysis

All statistical analyses were performed using R Statistical Software (v4.3.2; R Core Team, 2023). Data visualizations were created with the ggplot2 package (v3.5.1; Wickham, 2016). To investigate the relationship between the number of intra-GC base pairs in the hairpin structure (independent variable) and the relative abundance value (dependent variable) expressed as a percentage for each target species at 55°C as the annealing temperature for amplicon PCR, we conducted a simple linear regression analysis using the ‘lm’ function in R. P-value and regression coefficient reported here are derived from the regression model. A significance level of 0.05 was set for p-values.

Results

Metabarcoding of the 18 S V9 region of the 11 parasites under study

After cloning the 18 S rDNA V9 region (18 S V9) of human parasites detectable in human stool, 11 plasmids were extracted and pooled in equal amounts to prepare libraries for parasite metabarcoding. The NGS output showed a total of 434,849 reads, and all 11 parasite species were detected, suggesting that parasite metabarcoding is an effective method for detecting intestinal parasites. Interestingly, the sequenced read count ratio was different for each parasite despite the use of the same input amount (Fig. 2, Supplementary Table 1). Further, the read count ratio decreased as follows: C. sinensis, 74,893 reads (17.21%); E. histolytica, 73,045 reads (16.79%); D. latus, 62,728 reads (14.41%); T. trichiura, 47,320 reads (10.87%); F. hepatica, 37,994 reads (8.73%); N. americanus, 37,101 reads (8.53%); P. westermani, 37,017 reads (8.51%); T. saginata, 31,184 reads (7.17%); G. intestinalis, 21,895 reads (5.03%); A. lumbricoides, 7,529 reads (1.73%); and E. vermicularis, 4,143 reads (1.73%).

Fig. 2
figure 2

Relative abundances of the sequence reads in 11 intestinal parasite species: Ascaris lumbricoides, Clonorchis sinensis, Dibothriocephalus latus, Entamoeba histolytica, Enterobius vermicularis, Fasciola hepatica, Giardia intestinalis, Necator americanus, Paragonimus westermani, Taenia saginata, and Trichuris trichiura.

The DNA sequence of the 18 S V9 region formed a specific DNA secondary structure (a hairpin structure). Thus, we hypothesized that the GC base pairs in the hairpin structure are associated with the different sequenced ratios. The hairpin structures were computationally constructed, and the number of intra-GC pairs therein was counted (Fig. 3; Table 1). Next, linear regression analysis was performed between the NGS output ratio (a dependent variable) and the number of GC pairs in the hairpin (an independent variable). Thus, we observed that the greater the number of GC pairs in the hairpin of the 18 S V9 region, the lower the NGS output ratio, and this relationship was statistically significant (regression coefficient = − 1.0473, p = 0.00485; intercept = 28.6097, p = 0.00048 in Vector Builder). We repeated the test using the number of intra-GC pairs from a different program (Geneious Prime) and it also produced almost the same result as the previous one (regression coefficient = − 1.0133, p = 0.0179; intercept = 27.5076, p = 0.0022).

Table 1 Next-generation sequencing output and the number of intra-GC pairs in the hairpin structure of the 18 S rDNA V9 region of 11 intestinal parasites.
Fig. 3
figure 3

Secondary structure of the 18 S rDNA V9 region and GC pair numbers. (a) Trichuris trichiura, (b) Fasciola hepatica, (c) Clonorchis sinensis, (d) Dibothriocephalus latus, (e) Enterobius vermicularis, (f) Necator americanus, (g) Giardia intestinalis, (h) Entamoeba histolytica, (i) Paragonimus westermani, (j) Ascaris lumbricoides, (k) Taenia saginata.

Effect of plasmid linearization on NGS output

We used the restriction enzyme to linearize the plasmid and prepared three sample groups for amplicon NGS analysis: untreated samples, samples pooled before enzyme treatment, and samples treated individually with the enzyme before pooling (Fig. 4a, Supplementary Table 2). Amplicon NGS performed on these groups revealed no differences in relative abundance between the enzyme-treated and non-treated samples. This result was consistent even in the 10-fold diluted samples (Fig. 4b).

Fig. 4
figure 4

Bar plot showing the relative abundances of the plasmids of 11 species. Plasmid concentration at (a) 20 ng/µl and (b) 2 ng/µl. Bottom, no restriction enzyme treatment was performed; middle, the 11 plasmids were pooled first and then treated with the restriction enzyme, Nco1; top, each plasmid was first treated with the restriction enzyme (Nco1) and then pooled.

Metabarcoding at different annealing temperatures

We evaluated the effect of annealing temperature on the amplicon PCR process. In the first experiment (Figs. 2 and 3), the annealing temperature for the amplicon PCR process targeting the 18 S V9 region was 55 °C. Given that PCR efficiency may vary depending on the annealing temperature, we set various annealing conditions ranging from 40 to 70 °C with 3 °C increments for library preparation. As the temperature increased, an interesting NGS output pattern emerged; the number of reads for the different parasites became more similar (Fig. 5, Supplementary Table 3). Specifically, parasites that were initially detected in lower numbers, such as A. lumbricoides, E. vermicularis, and T. saginata, showed an increasing number of reads as the temperature increased.

Fig. 5
figure 5

Effect of annealing temperature on the relative abundance of sequence reads in 11 intestinal parasites.

Only F. hepatica reads showed a different pattern as the annealing temperature increased; its read number rather decreased with increasing temperature. Its relative abundance was only 0.2% at 70 °C. To confirm the decrease in the number of F. hepatica reads with increasing annealing temperature, PCR was performed, and for each individual plasmid containing the 18 S V9 region, the concentration of each amplicon was measured at 5 °C increments in annealing temperature. Thus, we observed that the F. hepatica amplicon concentration was notably lower than that of the other parasites at higher temperatures (65 and 70 °C) (Table 2).

Table 2 Changes in PCR amplicon total amounts (ng) on individual parasite plasmids with various annealing temperatures.

Discussion

Metabarcoding, which employs high-throughput sequencing of 18 S rRNA gene amplicons, plays a crucial role in the assessment of the diversity of eukaryotes in various ecosystems39. In this study, we used metabarcoding to screen various human intestinal parasites simultaneously. Studies in this regard are limited, and specifically, there are no reports on the identification of efficient methods for such analyses.

Via metabarcoding, we detected all 11 intestinal parasite species. DNA metabarcoding involves the utilization of NGS technology as an extension of barcoding. The advent of readily available NGS technologies has transformed the fields of clinical and public health microbiology50,51,52. Apart from expedited and precise pathogen identification, the amalgamation of high-throughput methodologies and bioinformatics offers novel understanding regarding disease transmission, virulence, and antimicrobial resistance.

Several studies have documented the use of NGS in detecting a wide range of parasites in humans. Targeted amplicon NGS identified 16 species of blood-borne helminths and protozoa, including Plasmodium falciparum, Leishmania infantum, Trypanosoma cruzi, and Brugia malayi53. Intestinal eukaryotic protists were detected in stool samples from healthy Tunisian individuals, including Dientamoeba fragilis, Giardia intestinalis, Cryptosporidium spp., Blastocystis sp., and Entamoeba sp36. Furthermore, long-read nanopore sequencing technology has proven effective in accurately detecting and characterizing a diverse range of filarial worms54. NGS has also been employed to identify Pneumocystis jirovecii and other pathogens in bronchoalveolar lavage fluid from patients with lung diseases55,56.

We investigated the effects of GC bonds, annealing temperature, and linearization of the plasmid DNA on the sequencing results of pooled libraries containing 11 species of intestinal parasites. Thus, we observed that the 11 intestinal parasite species all had different relative abundances. This could be attributed to the correlation between the GC pairs of the DNA secondary structure. Additionally, GC-rich regions, owing to the formation of stable and complex secondary structures within a DNA template, can block DNA polymerase during PCR and lead to ineffective amplification57. These regions in templates also often form intricate secondary structures that can resist denaturation during the PCR annealing phase. Moreover, the primers utilized for amplifying these GC-rich regions have a propensity to self-anneal and cross-anneal, creating stem-loop structures that may hinder the progression of DNA polymerase along the template58. Hydrogen bonding plays a crucial role in the interactions within a DNA base pair. The hydrogen bond energy between nucleobases that form a base pair has been calculated in numerous theoretical studies59. Thus, it has been reported that the G-C base pair is stronger than the A-T base pair owing to the greater number of hydrogen bonds formed between them. Specifically, the G-C base pair forms three hydrogen bonds, while the A-T base pair forms two hydrogen bonds60. Additionally, other factors, such as length, may also have an effect61.

Increasing the annealing temperature during PCR enhances the amplification of a specific DNA sequence57. This is due to improved specificity, reduced primer-dimer formation, increased stability of primer-template interaction, and optimal Tm (melting temperature) matching, all facilitated by the principles of thermodynamics and DNA base pairing. Specifically, an escalation in temperature increases the kinetic energy of DNA strands, diminishing the possibility of nonspecific interactions and fostering the creation of robust, targeted primer-template pairings. Further, at high temperatures, non-specific DNA regions are less likely to form stable interactions with the primers due to weak binding affinity. This minimizes the potential for non-specific amplification.

The effect of Tm on GC content does not necessarily disappear at high temperatures; it remains a factor that influences primer-template interactions. High annealing temperatures can compensate to some extent for differences in GC content between primers and templates, but it is still important to consider GC content when designing primers, regardless of the annealing temperature employed.

In summary, higher annealing temperatures ensure that primers bind specifically to the target DNA region, minimizing non-specific amplification and promoting efficient and accurate DNA amplification. In this study, we observed that when the temperature was raised, A. lumbriocoides, E. vermicularis, G. intestinalis, and T. saginata were well detected. Therefore, if metabarcoding is performed to detect these species, then it is recommended to set a relatively high annealing temperature. For F. hepatica, the read number was strangely reduced when the temperature was high. Therefore, in this case, it is necessary to pay attention to the fact that detection may become challenging at high annealing temperatures.

The results obtained notwithstanding, this study had some limitations. Similar amplification was achieved at elevated temperatures for all samples except F. hepatica. Although this species was characterized, the results obtained from its sequencing at high temperatures were limited; thus, we decided that an annealing temperature of 55 °C was the most appropriate for amplicon sequencing.

One limitation of our study is the use of universal eukaryotic primers. While this approach offers broader taxonomic coverage, it can also amplify DNA from hosts and other food materials present in faecal samples. This can lead to a masking effect, where parasite DNA is obscured by the more abundant host or non-target DNA, potentially underestimating parasite burden or missing low-abundance parasites32,33,62. To address this limitation in future studies, several strategies can be considered. Firstly, methods to decrease host DNA abundance, such as enzymatic removal or alternative depletion techniques based on previous studies53,63, can be employed. Additionally, using other target regions, such as the ITS2 region for nemabiome analysis and mitochondrial rRNA genes for helminth metabarcoding, as complementary methods to 18 S rDNA amplicon sequencing64,65,66,67,68,69,70, can provide highly accurate identification of nematode and helminth parasites, although this may reduce broader parasite detection coverage, particularly for protozoa. Another limitation of our study is the lack of validation with real-world samples, such as faeces. Despite our efforts, finding suitable faecal samples with multiple parasite infections was not possible. Nevertheless, our study provides valuable insights into factors influencing read counts and NGS library preparation optimization. This paves the way for future research using more specific methods for parasite detection in real-world samples.

Conclusions

Our findings provide insights into the factors that influence NGS read count and the optimization of library preparation protocols for parasite metabarcoding. The number of GC bonds in the secondary hairpin structure of amplicon DNA was found to be a significant determinant of amplification efficiency. Optimizing the annealing temperature for specific library preparation protocols can be proposed as a potential approach to improve the detection rate of specific parasites. Advancements in NGS technology, leading to greater accuracy and reduced costs, make the routine use of metabarcoding for helminth detection more feasible. These advancements hold promise for enhancing efforts to control and prevent intestinal parasitic infections, ultimately contributing to better public health outcomes.