Introduction

Spinal infection, defined as an infectious process involving the vertebral body, intervertebral disc, paraspinal soft tissue, or epidural space and confirmed by clinical presentation, imaging findings, and microbiological or histopathological evidence, presents significant diagnostic challenges due to its insidious onset, non-specific symptoms, and often inconclusive laboratory findings1,2,3,4,5,6. Timely and accurate identification of pathogens is essential for guiding appropriate antimicrobial therapy. Traditionally, diagnosis has relied on isolating pathogens from cultured spinal tissue or fluid samples; however, the low detection rates of conventional microbiological cultures pose substantial challenges for accurate diagnosis and targeted antimicrobial therapy7,8.

Metagenomic next-generation sequencing (mNGS) has emerged as a promising diagnostic method for infectious diseases due to its ability to detect a broad range of pathogens, including rare and unculturable organisms9,10,11,12,13. Compared to TCT, mNGS typically offers faster turnaround and higher detection rates. However, its diagnostic accuracy in spinal infections remains controversial, with studies reporting considerable variability in sensitivity and specificity.

For instance, Zhang Yi et al.14, in a prospective study involving 38 patients with suspected spinal infections, found that mNGS had a sensitivity of 0.86 and specificity of 1.00, while TCT showed a sensitivity of 0.49 and the same specificity of 1.00. In contrast, Ma Chi-yuan et al.15 reported lower diagnostic performance, with mNGS yielding a sensitivity of 0.69 and specificity of 0.27, and TCT showing even poorer sensitivity at 0.15 but a similarly high specificity of 1.00. These inconsistencies may be attributed to differences in sample sizes, specimen acquisition methods (e.g., open surgery versus percutaneous biopsy), and sequencing platforms used across studies. These variations highlight the need for a systematic evaluation to synthesize current evidence and identify the true diagnostic performance of mNGS compared to TCT.

To date, no meta-analysis has consolidated the diagnostic efficacy of mNGS and TCT in spinal infections. This study aims to assess the diagnostic value of both mNGS and TCT for spinal infections by including analyses from studies that utilized both testing methods.

Materials and methods

Search strategy

This study was registered in the PROSPERO database (CRD42022383002) and adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The PubMed, Embase, web of science, Cochrane library and SinoMed databases were searched until December 2023, respectively. The literature on the comparative diagnostic accuracy of mNGS and TCT was searched using specific terms including “mNGS”, “metagenomic next-generation sequencing”, “culture”, “spinal”, “spine”, “infection”, “infective”, and “spinal infection”.

Inclusion and exclusion criteria

Inclusion Criteria: (1) Studies involving subjects with suspected spinal infections; (2) Studies comparing the diagnostic performance of mNGS and TCT; (3) Studies containing necessary data for extraction, such as counts of False Positives (FP), True Positives (TP), False Negatives (FN), and True Negatives (TN); (4) Diagnostic criteria based on histopathological tests and the Infectious Diseases Society of America (IDSA) standards16. Exclusion Criteria: (1) Studies focusing solely on mNGS or TCT; (2) Studies lacking comparative analysis of mNGS and TCT; (3) Studies with incomplete data; (4) Case reports, animal studies, limited analyses, and review comments.

Data extraction

Two independent researchers(YZY, YCL) initially screened titles and abstracts according to the inclusion and exclusion criteria, followed by full-text reviews of eligible studies. Discrepancies were resolved through discussion. Common reference standards used for diagnosing spinal infections included histopathological tests and IDSA criteria. Extracted data comprised: (1) the first author’s name, year of publication, and study type; (2) number of cases and distribution in the diagnostic 4-fold table (FP, TP, FN, TN); (3) type of mNGS sequencing platform; (4) sampling method; (5) diagnostic reference standard for spinal infections.

Quality assessment

The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool was used to evaluate the methodological quality of included studies across four domains: patient selection, index test, reference standard, and flow/timing. Two independent researchers performed the assessment, with discrepancies resolved through group consensus. Studies classified as high risk in any domain were retained in the primary analysis but subjected to sensitivity analysis by excluding them to assess their impact on pooled estimates. Additionally, meta-regression was performed to evaluate whether QUADAS-2 risk categories (low/unclear/high) explained heterogeneity in sensitivity or specificity. No studies were excluded solely based on QUADAS-2 scores unless critical methodological flaws (e.g., non-consecutive enrollment without justification, lack of blinding between index test and reference standard) rendered diagnostic accuracy data unreliable.

Statistical analysis

Review Manager 5.3 and Stata 16.0 were used to calculate pooled sensitivity, specificity, and 95% confidence intervals (CI) using a random-effects model to account for anticipated heterogeneity across studies. Summary Receiver Operating Characteristic (SROC) curves and the Area Under the Curve (AUC) were generated. Heterogeneity was quantified using the Q-test and I² statistic, with I² ≥ 50% indicating substantial heterogeneity. When significant heterogeneity (I² ≥ 50% or p < 0.1) was detected, threshold effect analysis, sensitivity analysis (e.g., excluding studies with high QUADAS-2 risk scores), and meta-regression were performed to identify sources of variability. All statistical tests used a significance threshold of p < 0.05.

Results

Search results and quality assessment

The article selection process followed the PRISMA guidelines and is depicted in Fig. 1. A comprehensive search across the specified databases yielded 1,406 articles. After duplicate removal, 623 articles remained. Initial screening of titles and abstracts narrowed the selection to 22 articles. Full-text review led to the exclusion of 12 articles for various reasons: six did not meet the inclusion criteria, two lacked necessary data for extraction, three were case reports, and one was a narrative review. Ultimately, 10 studies were included in this review, consisting of two prospective and eight retrospective studies, involving a total of 770 patients. The basic characteristics of the included studies are summarized in Table 1. The methodological quality of these studies was assessed using the QUADAS-2 scale, as illustrated in Figs. 2 and 3. Most studies exhibited a low risk of bias concerning reference standards, procedures, and timing. However, four studies presented an unclear risk of bias regarding consecutive inclusion.

Fig. 1
figure 1

PRISMA flowchart of the study selection.

Table 1 Characteristics of the including studies.
Fig. 2
figure 2

Risk of bias and applicability concerns graph.

Fig. 3
figure 3

Risk of bias and applicability concerns summary.

Meta-analysis results

Efficacy of the two testing methods

mNGS demonstrated a pooled sensitivity of 0.81 (95% CI, 0.74–0.87) and specificity of 0.75 (95% CI, 0.48–0.91) (Fig. 4). In contrast, TCT showed a pooled sensitivity of 0.34 (95% CI, 0.27–0.43) and specificity of 0.93 (95% CI, 0.79–0.98) (Fig. 5). The positive likelihood ratio (PLR) for mNGS was 3.30 (95% CI, 1.30–7.90) and for TCT was 5.00 (95% CI, 1.70–14.80), while the negative likelihood ratio (NLR) was 0.25 (95% CI, 0.15–0.41) for mNGS and 0.70 (95% CI, 0.63–0.79) for TCT. In clinical diagnostic interpretation, a PLR > 10 or NLR < 0.1 is generally considered to provide strong diagnostic evidence. Neither mNGS (PLR = 3.30, NLR = 0.25) nor TCT (PLR = 5.00, NLR = 0.70) met these criteria, demonstrating that neither method alone can definitively confirm or exclude spinal infection. The wide confidence intervals for mNGS specificity (0.48–0.91) and TCT sensitivity (0.27–0.43) further emphasize substantial variability across studies. These findings underscore the necessity of a combined diagnostic strategy integrating clinical evaluation, imaging, and microbiological testing to optimize diagnostic accuracy. To further evaluate diagnostic accuracy, SROC curves were plotted, revealing an AUC of 0.85 (95% CI, 0.82–0.88) for mNGS (Fig. 6) and 0.59 (95% CI, 0.55–0.63) for TCT (Fig. 7).

Fig. 4
figure 4

Sensitivity and specificity of mNGS for diagnosing spinal infections.

Fig. 5
figure 5

Sensitivity and specificity of TCT for diagnosing spinal infections.

Fig. 6
figure 6

Summary receiver operator characteristic (SROC) curves based on mNGS.

Fig. 7
figure 7

Summary receiver operator characteristic (SROC) curves based on TCT.

Heterogeneity analysis

Sensitivity analysis was conducted on the 10 studies included, assessing Goodness-of-fit, Bivariate normality, Influence analysis, and Outlier detection for the TCT group, as shown in Fig. 8. However, Influence analysis and Outlier detection identified studies by Yuan Li 202221, Guanzhong Wang 202322, and Shi Shiyuan 202223 as potential sources of heterogeneity in the mNGS group (Fig. 9 and Supplementary Fig.S1). The Spearman correlation coefficient for sensitivity and specificity was 0.043 (p = 0.907) in the mNGS group and 0.411 (p = 0.238) in the TCT group, indicating that heterogeneity was not due to the threshold effect in either group. Further univariable meta-regression analysis was performed to explore the sources of heterogeneity. It was found that in the mNGS group, sensitivity was affected by sample size, and specificity was influenced by the sampling method. In the TCT group, sensitivity was influenced by the sampling method, while specificity was affected by both the sample size and the diagnostic reference standard (Tables 2 and 3, Supplementary Fig.S2 and S3).

Fig. 8
figure 8

Sensitivity analysis of TCT group: Diagram of (a) Goodness-of-fit (b) Bivariate normality (c) Influence analysis (d) Outlier detection.

Fig. 9
figure 9

Sensitivity analysis of mNGS group after removing 3 studies: Diagram of (a) Goodness-of-fit (b) Bivariate normality (c) Influence analysis (d) Outlier detection.

Table 2 Meta-regression analysis of the mNGS group.
Table 3 Meta-regression analysis of the TCT group.

Additionally, the Deeks funnel plot asymmetry test showed no significant publication bias in either the mNGS group (p = 0.59) or the TCT group (p = 0.96) (Supplementary Fig.S4 and S5).

In addition, meta-regression was performed to assess whether sequencing platform influenced diagnostic performance. No significant difference was observed in sensitivity (p = 0.06) or specificity (p = 0.95) between the Illumina and MGISEQ platforms.

Discussion

In recent years, mNGS has been increasingly utilized for diagnosing spinal infections, achieving promising clinical outcomes. In contrast, TCT, a classical method in the diagnosis of spinal infections, often suffers from low culture-positive rates. Consensus on the relative efficacy of these two diagnostic methods is still lacking. Consequently, this study reviewed the existing literature with the hope of filling this gap and providing a foundation for evidence-based clinical decision-making.

Most studies have indicated that mNGS offers significant clinical value in diagnosing spinal infections14,15,17. Our findings demonstrate that mNGS is more sensitive than TCT (0.81 vs. 0.34), albeit with a lower specificity (0.75 vs. 0.93). The PLR for mNGS and TCT was 3.30 and 5.00, respectively, while the NLR was 0.25 for mNGS and 0.70 for TCT. It is widely recognized that a PLR > 10 or an NLR < 0.1 can decisively confirm or exclude a diagnosis. Our findings suggest that neither mNGS nor TCT alone can definitively confirm or rule out spinal infections. However, mNGS exhibited not only higher sensitivity but also a more favorable diagnostic efficacy and a higher AUC compared to TCT, underscoring its potential as a more effective diagnostic tool in clinical settings.

The studies included in this analysis exhibited considerable heterogeneity. Meta-regression results indicated that small sample sizes (n < 50) and sampling methods contributed to the heterogeneity observed in the mNGS groups, while for the TCT group, the same factors plus reference standards were influential. Specifically, five studies14,15,19,22,24 had sample sizes under 50, potentially affecting the pooled results for both groups. In terms of sampling techniques, percutaneous puncture was used exclusively in three studies, while a combination of open surgery and percutaneous puncture was employed in seven. Percutaneous puncture biopsy is a critical method for obtaining specimens in spinal infections25 and is noted for minimizing skin contamination at the puncture site26. Jakrapun et al.27 conducted a meta-analysis on the efficacy of image-guided percutaneous puncture vertebral biopsy in spinal infections, demonstrating that sampling methods significantly impact diagnostic outcomes. Auid-Orcid et al.28 investigated sampling methods from 31 patients with suspected spinal tuberculosis. They found that the sensitivity for diagnosing tuberculosis was significantly higher with open surgical biopsy compared to percutaneous Xpert testing. Similarly, the specificity also showed a significant difference, being higher in open surgical biopsy versus percutaneous Xpert testing. Jin et al.29 reported similar findings, where the sensitivity of Xpert for diagnosing tuberculosis varied between open surgery (0.84) and percutaneous puncture (0.77). Our findings indicate that the specificity of mNGS and the sensitivity of TCT could be influenced by the sampling methods used. Additionally, this study’s results suggest that the specificity of TCT is affected by the reference standards employed, which may contribute to TCT’s lower sensitivity and relatively poor detection capability.

Different detection platforms have varying criteria for microbial positivity, which can impact test outcomes30,31. Zhang et al.32 utilized both Illumina and Nanopore mNGS platforms to assess their ability to detect microorganisms in alveolar lavage fluid. Their findings indicated that the Nanopore platform was superior in detecting Mycobacterium tuberculosis and fungi. Li Yulian et al.33 conducted a meta-analysis on the use of mNGS to detect Mycobacterium tuberculosis in the lungs, highlighting that variations in detection platforms could contribute to the heterogeneity in sensitivity observed. The mNGS platforms discussed in this paper include Illumina NextSeq and MGISEQ. The Illumina NextSeq platform is noted for its higher gene coverage, while the MGISEQ platform boasts greater sensitivity34. Nevertheless, our analysis found that the choice of detection platform did not account for the heterogeneity observed within the mNGS group. Meta-regression analysis revealed no statistically significant difference in diagnostic performance between the Illumina and MGISEQ platforms (sensitivity: p = 0.06; specificity: p = 0.95). However, the small number of Illumina-based studies (n = 4) may have limited the statistical power to detect subtle differences. These findings are consistent with a previous meta-analysis by Sike et al.35, which evaluated mNGS for diagnosing cerebrospinal fluid infections in pediatric patients and found that diagnostic performance did not significantly vary across Illumina and BGISEQ platforms.

Although meta-regression was employed to investigate the sources of heterogeneity in each group, it could not fully account for the observed variability. This study has several limitations potentially linked to various factors: (1) Antibiotic exposure prior to sampling could influence outcomes, as pre-sampling antibiotic use or brief discontinuation might eradicate microorganisms in tissues, thus affecting detection rates in both groups7,36,37. Husseini et al.38 recommend suspending antibiotics two weeks before a puncture biopsy to optimize microbial yield; (2) The results might also be influenced by the location and timing of the sample collection. Anderson et al.39 found that pus from punctures yields a higher positivity rate than bone or disc tissues, and biopsies from the upper thoracic spine show higher positivity rates. Alessandro et al.40 investigated factors affecting tissue cultures and found no significant differences related to the volume of tissue extracted or the biopsy area (bone tissue vs. intervertebral discs); however, sampling during the acute phase notably increased microbial detection rates; (3) mNGS can theoretically detect all potential pathogens in a sample by extracting nucleic acids41,42,43, but the sequencing platform’s dataset length might impact the accuracy of detection and the estimation of abundance44. Interpretation of mNGS results could be skewed by the presence of underlying microbiota, sample contamination, and host nucleic acids45. Four studies in our analysis reported possible contamination-related false positives, accounting for approximately 5.6% of mNGS-positive samples (15 out of 266), primarily involving low-abundance or environmental organisms. Regrettably, most of the 10 studies included did not fully report such data, preventing a thorough analysis of the relevant heterogeneity. This reinforces the importance of interpreting sequencing results in conjunction with clinical findings, conventional microbiology, and imaging; (4) Due to the high sensitivity of mNGS, there is an inherent risk of false-positive results stemming from sample contamination, environmental microorganisms, or colonizing species, particularly in specimens with low microbial burden. This limitation underscores the need for cautious interpretation of mNGS results and highlights the importance of integrating molecular findings with clinical judgment, imaging studies, and conventional microbiology to enhance diagnostic accuracy; (5) In addition to methodological concerns, practical barriers such as cost and accessibility limit the clinical applicability of mNGS. The test is expensive, requires advanced sequencing platforms, and depends on trained personnel and bioinformatics infrastructure—resources that may not be readily available in many healthcare settings. These economic and logistical constraints should be considered when evaluating the feasibility of widespread implementation; (6) Although the study included dual-test data from the same patient cohorts, we did not apply a contrast-based or arm-based model due to limitations in cross-tabulated reporting and the complementary nature of our research objective. Future studies may explore these models in head-to-head comparisons with standardized paired data; (7) Due to inconsistent or missing pathogen-specific data across the included studies, we were unable to assess whether diagnostic performance varied by microorganism type. Future studies incorporating standardized microbiological classifications may help clarify test-specific strengths for different spinal pathogens.

A key strength of this study is that, to the best of our knowledge, it represents the first meta-analysis to systematically and directly compare the diagnostic performance of mNGS and TCT specifically in spinal infections. This study offers a distinct contribution to the current body of literature. By synthesizing data from ten studies across varied patient populations, sampling methods, and sequencing platforms, this analysis provides robust evidence to guide clinical practice.

Conclusions

This study represents the first comparative meta-analysis of the diagnostic performance of mNGS versus TCT in spinal infections. The results demonstrate that mNGS provides higher sensitivity and overall diagnostic accuracy compared to TCT. However, its lower specificity indicates a risk of false positives, suggesting that mNGS may be more effective when used in conjunction with traditional diagnostic methods such as TCT to improve diagnostic confidence.