Interpretable molecular decision-making with DNA-based scalable and memory-efficient tree computation

Liu, Junlan; Tang, Qian; Han, Yongqi; Song, Jinxing; Wang, Fei; Guo, Pei; Fan, Chunhai; Tan, Weihong; Han, Da

doi:10.1038/s41467-025-66610-1

Download PDF

Article
Open access
Published: 21 November 2025

Interpretable molecular decision-making with DNA-based scalable and memory-efficient tree computation

Nature Communications volume 16, Article number: 10311 (2025) Cite this article

4613 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

DNA computing has emerged as a transformative paradigm for tackling computational problems at the molecular level, yet existing approaches remain constrained in algorithmic interpretability, efficiency, and scalability. Here we present a DNA-based decision tree system that modularly embeds classification rules into DNA strand displacement reaction cascades for interpretable decision-making across various configurations. It supports cascaded networks exceeding 10 layers, parallel computation of 13 decision trees in a Random Forest involving 333 strands, and multimode operation (linear/nonlinear, binary/multi-class, single/tandem trees), while maintaining low leakage, rapid signal propagation, and minimal computational elements. Coupled with a DNA-methylation sensing module, it translates biomarker profiles into molecular instructions for tree traversal, reproduces in-silico predictions and enables accurate disease subtype classification. The decision tree system represents an interpretable, scalable, and memory-efficient DNA computing approach and will open new avenues for programming intelligent molecular machines with broad applicability.

Supervised learning in DNA neural networks

Article Open access 03 September 2025

A spatially localized DNA linear classifier for cancer diagnosis

Article Open access 29 May 2024

DNA as a universal chemical substrate for computing and data storage

Article 09 February 2024

Introduction

With vast amounts of data being created in the digital age, there is a growing demand for computing paradigms that can process information with high efficiency and scalability. DNA-based molecular computing, owing to its intrinsic biocompatibility and programmability, has emerged not merely as an alternative to traditional silicon-based computation but also as a powerful biocomputing tool for analyzing multiple biomarkers directly within biological samples^1,2,3,4. With its unique advantages, including massive parallelism, precise programmability, and high-density data storage capacity⁵, the DNA-based computing paradigm has witnessed rapid progress across both enzyme-free and enzymatic systems, particularly in terms of scalability^6,7, computational performance and efficiency^8,9,10, and universality¹¹. Notably, recent advancements have enabled DNA computing systems to execute machine learning tasks such as artificial neural networks^6,10,12, convolutional neural networks⁷, support vector machines², and perceptrons¹⁰. These systems have successfully performed pattern recognition tasks like handwritten digit classification and nonlinear decision-making with sharp decision margins^10,13. However, these connectionist models typically operate as computational “black-box” with limited interpretability in their decision-making process^14,15. The inherent opacity limits their practical uses, especially in applications that demand high interpretability, such as medical diagnosis^16,17. By contrast, decision tree-based algorithm provides interpretable decision-making through explicit IF–THEN rule statements and traceable decision paths¹⁸. These features make decision trees particularly attractive for applications that demand explainable decision-making, such as medical diagnosis^16,19.

However, decision tree algorithm remains underexplored in the DNA-based computing paradigm. Previous research has mimicked the operations of decision trees through either enzyme-assisted or enzyme-free logic gates^10,11, where rules of a decision tree are translated into logic gate-based computing frameworks, but such indirect mapping obscures interpretability and compromises scalability. Enzymatic systems, while well-suited for multi-layer systems and offering many advantages such as high sensitivity and rapid kinetics, require stricter reaction conditions and costly reagents²⁰. Thus, we chose an enzyme-free approach for its cost-effectiveness and operational compatibility. However, implementing decision trees with enzyme-free computing paradigms may introduce leakage and signal attenuation²¹. These issues make circuit composition delicate and depth constrained, i.e., the maximum depth achieved by a single DNA computing system is confined within 6 layers thus far^{3,4,11,12,22,23,24,25,26}. Furthermore, most DNA computing devices that emulate the Boolean paradigm of electronics suffer from substantial memory costs, with even basic operations demanding massive distinct species. For instance, implementing a 2-bit comparator using a dual-rail logic circuit involves more than 300 distinct DNA strands¹¹. Therefore, there remains a pressing need for a DNA computing system that can directly implement decision tree computation with a scalable and memory-efficient framework.

In this work, we develop an enzyme-free DNA computing system that implements scalable and memory-efficient tree computation for interpretable decision-making. In our design, each decision node is encoded by a set of modular DNA molecules. The decision-making, executed via tree traversal, is facilitated by a leak-resistant, entropy-driven strand displacement cascade. The modularized encoding strategy allows decision nodes to be flexibly combined, supporting scalable construction of binary trees, multi-branching hierarchical structures, and a Random Forest (RF) classifier composed of multiple decision trees. This system exhibits high scalability in both breadth and depth, as demonstrated by the execution of up to 13 parallel decision trees comprising 333 unique DNA strands, and tree depths of over 10 computational layers, with minimal leakage (< 20%) and propagation times under 60 min. This DNA-based tree computing system not only overcomes major barriers in memory usage, depth scalability, and interpretability, but also pioneers a new molecular computing paradigm for explainable, programmable, and biologically relevant decision-making.

Results

Overview of the DNA-based decision tree computation

Instead of establishing implicit mathematical mappings from inputs to outputs, decision tree implements an explicit computational scheme where decision-making rules are embedded within the traversing paths connected by edges and nodes. Decision-making is realized by traversing along these paths, with each unique path leading to a different outcome (e.g., go to the movies or not) (Fig. 1a). At the molecular level, each node is encoded by a set of DNA duplexes with four distinct domains (Domain 1: parent node; Domain 2: current node; Domain 3: edge identifier; Domain 4: child node) (Fig. 1b). For instance, the encoding molecule of node C is composed of four distinct sequence domains, each following a uniform design principle: Domain 1 is derived from the preceding parent node (i.e., a2), Domain 2 signifies the current node itself (i.e., C), Domain 3 encodes the intervening edge (i.e., c1), and Domain 4 denotes the subsequent child node (i.e., D). This design ensures a coherent and structured approach to the molecular construction of a decision tree. Using a dual‑rail DNA logic circuit to implement these decision-making rules, however, requires 18 logic gates that encompass a total of 48 DNA molecules in a general-purpose DNA computing scheme¹¹, which is 8-fold higher than that required for the decision tree computing scheme designed in this work (Fig. 1c). The traversing instruction is encoded in the ssDNA input crafted with two sequence domains: one domain encodes the current node (i.e., A), and the other domain encodes the connecting edge (i.e., a2) (Fig. 1d). Each node operates through three successive states: (i) untraversed, where its input-binding toehold is blocked; (ii) activated, where an activator from the parent node displaces the blocker and exposes this toehold; and (iii) traversed, where the node-specific input strand binds and displaces the output strand. This displacement reaction releases a child-node activator that propagates the signal forward, and a recycled parent activator that can activate additional copies of the same node-encoding molecule. The underlying mechanism is an entropy-driven strand displacement reaction cascade (Fig. 2a), in which the release of two products from one reactant increases system entropy and thus renders the process thermodynamically favorable. Such a dual-product release enables rapid turnover of activators and robust signal relay (Fig. 2b).

**Fig. 1: Framework of the DNA-encoded computational decision tree.**

**Fig. 2: Traversing a 10-layer decision tree.**

To enable multi-hop traversals with minimal leakage and propagation delays, we explored various design strategies. The ‘clamp’ domains contributed modest improvements to leakage mitigation (0.08 to 0.03×) (Fig. S1), and long‑range signal transmission remained hindered by persistent leakage, particularly when input was present at high concentrations (Fig. S2). To effectively suppress leakage, we engineered a toehold-extended filter for each node. The filter preferentially annihilates low-concentration spurious activators, while allowing high-concentration specific activators to propagate unimpededly (Fig. S3a). Based on the well-characterized toehold-mediated strand displacement (TMSD) reaction kinetics²⁷, we set its toehold length to be 8 nt, a regime where the reaction rate reaches its plateau, and optimized the stoichiometric ratio of filter to node-encoding duplex. As shown in Fig. S3b, increasing filter concentration improves leakage suppression, but excessive filters can retard intended signal propagation. We therefore selected a filter-to-node duplex ratio of 1:5, which provided effective leakage suppression without diminishing the desired signal. The entropic increase provides the driving force for signal propagation, while leakage filters selectively remove low-level noise to ensure the high-fidelity transmission of correct computational results. Their integration enabled the construction of DNA decision trees capable of multi-hop traversal, effectively extending network depth well beyond the typical 6-layer limitation of enzyme-free systems. While the cascaded network depth may vary with application requirement, the ability to sustain such long-range cascades provides broad utility for implementing complex decision-making rules.

The performance of a 10-layer DNA-encoded decision tree is noteworthy (Fig. 2c). Compared to prior enzyme-free DNA computing systems, which were typically limited to fewer than 6 layers and exhibited signal decay in deeper cascades^11,12,22,23, the DNA-encoded tree computing system demonstrated robust long-range signal transmission across 10 layers (Fig. 2d). Leakage remained below 20%, while intended output yields were maintained at similar levels across all computational layers, and the ON–OFF contrast remained stable in all layers (Fig. 2e), indicating that supplying excess input strands sustains multi-layer traversal without compromising signal fidelity. Crucially, the half-completion time for computing 10 layers was within ~60 min and showed a minimal increase from 6 to 10 layers (Fig. 2e). In contrast, previous enzyme-free DNA systems required much greater time costs, taking more than 1 hour to complete 5 layers (Fig. S4), and even enzymatic DNA systems typically plateau at 5 layers despite their faster kinetics^8,9,10. Combining deep cascades with leakage-resistant propagation and hour-scale computational speed, this system is well-positioned to support multi-layered decision tree computation.

Embedding logic rules within a single decision tree

To incorporate decision-making rules into molecular decision trees, we constructed a simple binary tree and examined its decision-making performance (Fig. S5a). The results of a simple binary tree show that all input instances could make desired decisions by traversing correct paths (Fig. S5b). To handle more complicated tasks, we engineered a multibranched tree with rich lateral connectivity between nodes in the same layer, as well as top-down connections (Fig. S6a). Its decision-making effectiveness was tested, and correct responses were still observed, with low-level leaky signals across all test cases (Fig. S6b). Again, this demonstrates the ability of the molecular decision tree to accurately process input information and generate a correct classification result.

Integrating multiple decision trees for building a Random Forest algorithm

To further showcase scalability and parallelism, the tree computing system was used to implement RF algorithm, a machine learning model that aggregates ensemble predictions from multiple independent decision trees via majority voting²⁸. Each tree acts as an independent voter to evaluate its own decision splits, and the input instance is ultimately classified into the majority class based on collective votes from all trees (Figs. 3a and S7). To ensure the independent operation of every decision tree within the ensemble, we devised a Unique Toehold-Mediated Specific Node Activation (UTMSNA) strategy (Refer to Supplementary Note 1 for more details). In UTMSNA, the sequence of the initiation toehold in a node-encoding duplex is exclusively determined by its parent node, therefore, only parent node-derived activators are fully complementary to the initiation toehold, while activators from other nodes contain mismatches. Experimental measurements confirmed that even a single mismatch at the toehold or branch migration domain markedly reduced activation efficiency, even under a 20-fold excess of mismatched activators (Fig. S8). This topology-specific activation permits the same decision node, regardless of whether they represent the same entity, to be placed at various locations within the same decision tree (Fig. S9) or even across various distinct trees. This reaction orthogonality lays the foundation for implementing RF, where each tree independently processes input features. To aggregate each tree’s output, we employed a DNA-based winner-take-all (WTA) network⁶, which relies on cooperative hybridization reactions to selectively amplify the signal corresponding to the most frequent output, thereby selecting the most frequent classification result using majority voting.

**Fig. 3: Exploring the parallelism of multiple decision trees.**

To validate our design, we trained an example RF model using a methylation chip dataset and detected the lowest out-of-bag error in the case of 12 trees in silico (Fig. S10a). To avoid ties in the voting process, we obtained a RF with 13 decision trees (Fig. S11), which also gave desirable performance (Fig. S10b). Next, we implemented the in-silico trained RF with DNA-encoded decision trees. Specifically, six CpG features related to the prognostication of lung carcinoma in situ were selected as binary node splitters and phrased in terms of methylation levels. Notably, the input species of each feature were consistent across the entire forest, with each node being activated only by its unique parent on the same tree. The computing performance of the DNA-based RF was first evaluated by a subset of 5 trees with synthetic DNA input combinations that represent these CpG features. The steady-state outputs of every single tree were approximately 1×, and integrating every three trees still led to the intended predictions (Figs. 3b and S12). And the RF could generate correct outputs based on the collective votes of all its trees, which determines the majority class (Figs. 3c and S13). Generally, these observations suggest that DNA-based RF can integrate ensemble predictions into a final decision based on majority voting. As a large-scale parallel computing tree, the molecular forest consists of 46 nodes and 96 edges, in total encompassing 126 double-stranded computing elements and 333 participant oligonucleotides in a single computing system. This scalability exceeds most previously reported DNA computing systems (Table S1).

Embedding diagnostic rules within DNA-encoded decision trees

After establishing and validating the computational performance of DNA-encoded decision trees and RF, we sought to integrate clinically derived diagnostic rules into DNA-encoded decision tree (Fig. 4a). First, we constructed a binary decision tree (Fig. 4b) for diagnosing different subtypes of hepatitis B virus (HBV) infections²⁹. Each node represents a viral genetic marker or antibody-related biomarker that can be used to classify HBV infection subtypes. The performance of this subtyping decision tree was evaluated using synthetic DNA inputs. We observed that all combinations could accurately traverse the decision tree to generate correct classification results with minimal signal overlap from other channels (Figs. 4c, d and S14). Notably, input combinations with many identical attribute values (e.g., #7 and #9) still gave correct outputs, suggesting that UTMSNA enables topologically distinct nodes to be selectively activated, even though they symbolize the same entity.

**Fig. 4: Embedding diagnostic rules into decision tree for disease subtyping.**

After showcasing the single-tree classifier, we next validated the system’s versatility by cascading two decision trees for the diagnosis and subclassification of diabetes³⁰ (Fig. 4e). Synthetic input combinations representing various disease profiles were utilized to showcase the sequential traversal over the tree cascade (Fig. S15), and the tandemly integrated tree gave desirable outputs with 100% accuracy to human judgement (Figs. 4f, g and S16). Besides a single variable, it is also intriguing to symbolize a linear classifier as a decision node and construct a hybrid model. The integration of linear classifiers enables more sophisticated tasks such as multi-categorical and nonlinear classification. Nonlinear classification involves decision boundaries that go beyond simple linear separations, allowing for more complex patterns and relationships between variables to be captured. As a demonstration, several linear classifiers (Figs. S17 and S18) were built into decision nodes to collectively constitute the node splitting conditions, where every split was assessed in the form of a question involving both variables (Fig. S19a). The beginning node demarcates the first quadrant along the orthogonal line, so that each part is sequentially divided into three subparts as a function of the other nodes (Fig. S19b, c). When connected in a top-down manner, the input space could be partitioned into six subdivisions (Fig. S19d), showcasing the versatility for more intricate nonlinear and multi-classification tasks.

DNA-encoded decision tree for thymoma subtyping based on methylation biomarkers

While synthetic inputs can initiate tree traversal by emulating biomarker information, the true potential of the DNA-encoded decision tree lies in its ability to integrate DNA-based biomarker-sensing module, thereby enabling diagnostic classification tasks based on actual biomarker information (Fig. 5a). To move beyond the use of artificial inputs, we selected 5-cytosine DNA methylation, the most common form of DNA methylation³¹, as the testbed for autonomous sensing and decision-making in thymoma subtyping³². Using the thermostable 9°N DNA ligase, we engineered molecular inversion probe (MIP) to recognize CpG locus of interest by site-specific intramolecular hybridization and ligase-mediated probe circularization reaction (LPCR) (Fig. S20a). Exonucleases would digest leftover linear probes, leaving circular probes alone for downstream processing. In this way, the methylation level of each CpG locus was transformed into a mixed population of methylation- and unmethylation-specific circularized MIPs at a ratio determined by the proportion of methylated targets. LPCR performance was evaluated by quantifying circular product yields across serial concentrations of target template (Figs. 5b and S20b–e), verifying ligation specificity with cross-tested methylation- and unmethylation-specific MIPs (Fig. S21), and assessing cross-reactivity by cross-validating five CpG-targeting MIPs against target and non-target bisulfite-converted synthetic templates (Fig. 5c). These results confirmed both high sensitivity and specificity of the methylation-sensing module. Additionally, an analog-to-digital convertor was designed to transform circularized MIPs to ssDNA indicators of different methylation stages (Fig. S22). The convertor could reliably transform either linear or circularized MIPs into designated ssDNA indicators (S22–23). Quantitative evaluation across the full β-value range (0–1.0) demonstrated robust category discrimination when Δβ was at least 0.05 away from the classification thresholds, whereas resolution near the boundary regions (Δβ ≤ 0.05) remained a limitation of the analog-to-digital converter (Fig. 5d).

**Fig. 5: DNA-encoded decision tree integrated with a sensing module.**

Next, we trained a thymoma-subtyping decision tree (Fig. 5e), and verified that a DNA-encoded version of this decision tree can lead to desired kinetics behaviors using circularized MIPs as input (Fig. 5f). Its performance was further evaluated in 17 test samples, where synthetic bisulfite-converted DNA sequences were mixed at predefined β-values across five CpG loci to reconstruct the methylation matrices used to train the decision tree classifier. After LPCR-based sensing and exonuclease-based clean-up, their products after analog-to-digital conversion were used to initiate traversal on the DNA-encoded decision tree. Output analysis showed 100% concordance (17/17 agreement) between molecular computation and in-silico model’s predictions (Fig. 5g). By integrating methylation sensing, analog-to-digital conversion, and decision-tree traversal, methylation profiles can be directly processed into classification outcomes without relying on synthetic traversing instructions, indicating the DNA-encoded decision tree can interface with biomarker-sensing modules, extending the scope of the tree computing system to diverse biomedical applications.

Discussion

By representing decision-making rules symbolically and implementing them in programmable, modular, and scalable DNA-based decision trees, we achieved diverse decision-making tasks in an interpretable manner. The modularity of DNA-encoded decision tree reduces memory cost, eliminating the extensive rewiring required in conventional dual-rail Boolean circuits when decision logics are updated. For instance, tree structures can be readily modified by replacing sequence domains or supplementing additional node duplexes to implement updated rules (Fig. S24a, b), while Boolean logic circuits entail not only a lot more computational elements but also substantial circuit rewiring for a minor modification (Fig. S24c, d). Crucially, DNA-encoded decision trees are computationally more compact. As the number of nodes and edges increases, Boolean computing paradigms exhibit a multifold growth in circuit complexity: 7.7-fold and 4.1-fold more logic gates per additional node and edge, respectively (Figs. S25–S26). By contrast, the decision tree computing system achieves equivalent functionality with threefold fewer DNA elements (Fig. S25), effectively eliminating redundant circuit components.

To facilitate the design of a DNA-based decision tree, we developed a Java-based automated sequence generator to produce the sequences of encoding molecules required for experimental demonstration of a user-defined tree (Supplementary Note 2). While computationally compact, DNA-encoded decision trees exhibit remarkable scalability in terms of participant DNA computing components, with 126 double-stranded and 333 single-stranded species in a single diffusion-based nonenzymatic DNA computing system. Although we experimentally demonstrated cascades up to 10 layers, the observed trend of leakage accumulation indicates that the system could, in principle support additional layers before reaching the OFF-state threshold.

While this work focuses on an enzyme-free DNA computing framework, it is also intriguing to compare it with enzymatic systems, which offer notable advantages, particularly in error correction, signal tunability, and nonlinear classification¹⁰. In enzymatic systems, signal strands are synthesized by polymerases and promptly degraded by exonucleases, creating a dynamic production–degradation loop that acts as an intrinsic error correction mechanism to suppress the accumulation of spurious strands and enable deeper, more nonlinear networks with compact systems and sharp decision margins¹⁰. By contrast, enzyme-free DNA strand displacement systems preload all reactants into one pot. The persistent possibility of unintended interactions, such as fraying-induced strand displacement or off-target binding, can trigger background activation, especially in multi-layer cascades. To mitigate these issues, we incorporate multiple strategies such as leakage filters to annihilate spurious signals, UTMSNA to ensure inter-tree independence, and careful sequence design to reduce nonspecific reactions. These features enabled our system to reliably execute tree computation across more than 10 layers. Enzyme-free systems offer clear advantages such as cost-efficiency and operation under ambient conditions, but they typically require nanomolar input concentrations due to the slower kinetics of TMSD at lower reactant concentrations³³. This poses a significant challenge for practical applications such as liquid biopsies, where biomarkers in biofluids are often present at pg/mL concentrations or lower³⁴. By contrast, enzymatic systems provide higher sensitivity and can support analog amplification with programmable gain, making them particularly well-suited for ultrasensitive biosensing¹⁰. Nonetheless, several promising strategies have been developed to accelerate TMSD reaction without fundamentally redesigning DNA circuits, including DNA origami-based localization³⁵, specialized enzyme³⁶, and the application of commercially available cationic polymers that electrostatically recruit DNA strands to locally increase their effective concentration and reaction rate³³. These accelerative strategies provide a promising avenue to adapt enzyme-free systems for clinically relevant low-abundance biomarker detection.

Lacking advanced reasoning capabilities required for complex tasks limits the potential applications of current machine learning models in biomedicine. As demonstrated above, we anticipate that leveraging the DNA-based decision trees will offer three significant advancements to address this challenge. First, the inherent ability of tree-based models to explicitly represent logical reasoning makes them highly interpretable^18,37, as they clearly outline the biomarkers or symptoms influencing a decision, thereby fostering reliability in applications such as healthcare decision support¹⁹. Second, the integration of decision-making rules into molecular systems enables the encoding of complex algorithms into biological substrates, bridging the gap between computational logic and molecular intelligence. Third, the combination of DNA-based decision trees with linear classifiers exemplifies how tree-based DNA computational schemes can integrate different paradigms (symbolic and connectionist^38,39) into a hybrid model able to perceive and respond to environmental signals at a molecular level, paving the way for the development of intelligent biomedical platforms capable of advanced diagnostics and precision medicine. These advancements position DNA-based tree computing systems as transformative tools that merge computational precision and biological relevance and will inspire widespread applications of molecular intelligence across biological, medical, engineering, and various other fields.

Methods

Oligonucleotide synthesis

All single-stranded DNA oligonucleotides used in this study were synthesized by Sangon Biotech (Shanghai, China) or GenScript Biotech Corporation (Nanjing, China), depending on sequence length. Sequences shorter than 130 nucleotides were ordered from Sangon Biotech, whereas those exceeding 130 nucleotides were obtained from GenScript to ensure high coupling efficiency during synthesis. Unmodified DNA oligonucleotides were purified using ultra-polyacrylamide gel electrophoresis (ULTRAPAGE), and oligonucleotides containing fluorescent or quencher modifications were high-performance liquid chromatography (HPLC)–purified. All lyophilized DNA samples were dissolved in 1× TE buffer (10 mM Tris–HCl, 1 mM EDTA, pH 8.0; DNase/RNase-free; Sangon Biotech, Cat. No. B548401-0500) to a final concentration of 100 μM. Stock solutions were aliquoted and stored at 4 °C to minimize freeze–thaw cycles. The concentrations of oligonucleotides were determined by UV absorbance at 260 nm using a NanoDrop One spectrophotometer (Thermo Fisher Scientific), and molar concentrations were calculated using extinction coefficients predicted by the OligoAnalyzer Tool (Integrated DNA Technologies). All oligonucleotide sequences used in this study have been deposited in Figshare [https://doi.org/10.6084/m9.figshare.30339310]⁴⁰.

Assembling of node-encoding DNA duplexes

To assemble each node-encoding duplex, the corresponding single-stranded components (top strand, blocker, and activator) were mixed in a molar ratio of 1.1:1.5:1 in 1× TE buffer containing 12.5 mM MgCl₂ (Sigma-Aldrich, Cat. M1028). The mixture was heated to 95 °C for 5 min in a thermal cycler (Bio-Rad T100) and gradually cooled to 25 °C at a ramp rate of 1 °C per min to ensure complete hybridization. Annealed duplexes were stored at 4 °C and used directly in downstream reactions without further purification. The final duplex concentration was estimated based on the concentration of the activator strand in the mixture, as determined by UV quantification.

Classifier training and validating

An epigenomic dataset from the Illumina HumanMethylation450 BeadChip (NCBI GSE108124) was obtained to train a RF classifier by the “randomForest” package in RStudio (version 2022.02.0). The training dataset consists of a β-value matrix detected in 54 patients with preinvasive lung cancer lesions, among which 18 and 36 were labeled as “Regressive” and “Progressive”, respectively. A feature elimination step based on tenfold cross-validation (five trials) was then performed to find the optimal set of features with the smallest number of biomarkers and minimum cross-validation error. From the feature importance list ordered from the highest to lowest mean decrease in accuracy, the top features were selected to train a forest, and the forest with the lowest out-of-bag estimate of the error rate after multiple tuning of the RF parameters with a custom random forest algorithm supported by the ‘caret’ training function was used as the final RF model. To train the decision tree algorithm, we obtained a DNA methylation dataset from the Infinium Methylation EPIC database (NCBI GSE218549) for 122 thymoma patients. The dataset was processed by LIMMA to select differentially methylated positions. Methylated positions with adjusted p values lower than 0.1 were filtered out, 85% of the dataset was randomly spat split into a training dataset, and the remaining data was used for testing. We trained a decision tree on the training dataset to distinguish between different subtypes of thymoma using the ‘rpart’ package. The performance of these classifiers was evaluated using the area under the curve calculated by the pROC package and confusion matrix.

Sequence design of molecular inversion probes

Target-specific MIPs were designed to capture bisulfite-converted CpG loci. The Infinium MethylationEPIC annotation file (GPL21145) was used to extract 100–150 bp flanking regions of each selected CpG site. Each target sequence was in silico bisulfite-converted (converting non-CpG cytosines to thymine) using MethPrimer⁴¹. The complementary capture region of each MIP was designed using an online hybridization calculator (New England Biolabs, https://ligasecalc.neb.com/#!/ligation) to estimate the melting temperature, maintaining inter-probe variation within ±2 °C for uniform reaction conditions. Then, a non-complementary linker between the two capture arms was introduced and optimized to ensure stable hybridization while suppressing nonspecific ligation events. GC content and potential secondary structures were analyzed using NUPACK⁴² at 30 °C to minimize undesired intramolecular folding.

Thermostable ligase-dependent probe circularization

Circularization reactions were carried out using 9°N™ DNA Ligase (New England Biolabs, M0238S) under the manufacturer’s recommended buffer conditions. Each 20 μL reaction contained 1× 9°N™ ligase buffer, 1.6 U μL⁻¹ 9°N™ ligase, 20 nM synthetic bisulfite-converted template, 1000 nM each of methylation- and unmethylation-specific MIPs. Thermal cycling was performed using a C1000 Touch Thermal Cycler (Bio-Rad) with the following program: 94 °C for 2 min; 20 cycles of 94 °C for 10 s and 52 °C for 5 min. Post-ligation, unreacted linear probes were digested by adding 3 U Exonuclease I (NEB, M0293S) and 3 U RecJf Exonuclease (Beyotime, D6006S), followed by incubation at 37 °C for 1 h, and enzyme inactivation at 85 °C for 30 min. Circularized products were directly used as inputs for downstream fluorescence reactions without purification.

Fluorescence kinetics experiment

Reactions were performed in 96-well black flat-bottom microplates (Corning, Cat. No. 3916) with sealed transparent lids to prevent evaporation. Unless otherwise specified, assays were carried out in 1× TE buffer supplemented with 12.5 mM Mg²⁺, maintained at 30 °C. Excitation/emission settings were: ATTO425 (Ex 435 nm/Em 475 nm), FAM (Ex 493 nm/Em 522 nm), VIC (Ex 535 nm/Em 567 nm), ROX (Ex 585 nm/Em 615 nm), AP639 (Ex 643 nm/Em 671 nm), and SF693 (Ex 672 nm/Em 699 nm). To ensure consistent baseline normalization, fluorescence signals were first recorded for 3 min prior to input addition. Subsequently, either synthetic input strands or circularized MIPs were introduced, and kinetic fluorescence data were collected every 2 min.

Data normalization

To quantify output strand concentrations from fluorescence measurements, raw fluorescence data were converted using a calibration curve established by linear regression of steady-state fluorescence intensities recorded at a series of known reporter-initiator concentrations. For normalization, fluorescence signals corresponding to each fluorophore were rescaled to relative fluorescence units (RFU). The initial fluorescence from the negative control (lacking input strands) was defined as the baseline (RFU = 0), whereas the highest fluorescence value observed in each set of parallel kinetic fluorescence experiments or that obtained from the corresponding standard reporter initiators—was defined as the upper reference (RFU = 1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All experimental data generated in this study are provided in the main text, Supplementary Information, and Source data file. DNA sequences used in this study have been deposited in Figshare [https://doi.org/10.6084/m9.figshare.30339310]⁴⁰. The previously published datasets used for Random Forest and Decision Tree model training and validation are available from the NCBI Gene Expression Omnibus under accession codes GSE108124and GSE218549, respectively. Source data are provided with this paper.

Code availability

The code used to develop the model, perform the analyses and generate results in this study is publicly available and has been deposited in GitHub at https://github.com/Junelann/DNAdecisiontree, under Apache-2.0 license. The specific version of the code associated with this publication is archived in Zenodo and is accessible via https://doi.org/10.5281/zenodo.17332060⁴³.

References

Jia, S. et al. DNA-based biocomputing circuits and their biomedical applications. Nat. Rev. Bioeng. 3, 535–548 (2025).
Article Google Scholar
Zhang, C. et al. Cancer diagnosis with DNA molecular computation. Nat. Nanotechnol. 15, 709–715 (2020).
Article ADS PubMed Google Scholar
Liu, J. et al. Unlocking genetic profiles with a programmable DNA-powered decoding circuit. Adv. Sci. 10, e2206343 (2023).
Article ADS Google Scholar
Lopez, R., Wang, R. & Seelig, G. A molecular multi-gene classifier for disease diagnostics. Nat. Chem. 10, 746–754 (2018).
Article CAS PubMed Google Scholar
Yang, S. et al. DNA as a universal chemical substrate for computing and data storage. Nat. Rev. Chem. 8, 179–194 (2024).
Article PubMed Google Scholar
Cherry, K. M. & Qian, L. Scaling up molecular pattern recognition with DNA-based winner-take-all neural networks. Nature 559, 370–376 (2018).
Article ADS CAS PubMed Google Scholar
Xiong, X. et al. Molecular convolutional neural networks with DNA regulatory circuits. Nat. Mach. Intell. 4, 625–635 (2022).
Article Google Scholar
Su, H. et al. High-efficiency and integrable DNA arithmetic and logic system based on strand displacement synthesis. Nat. Commun. 10, 5390 (2019).
Article ADS PubMed PubMed Central Google Scholar
Song, T. et al. Fast and compact DNA logic circuits based on single-stranded gates using strand-displacing polymerase. Nat. Nanotechnol. 14, 1075–1081 (2019).
Article ADS CAS PubMed Google Scholar
Okumura, S. et al. Nonlinear decision-making with enzymatic neural networks. Nature 610, 496–501 (2022).
Article ADS CAS PubMed Google Scholar
Lv, H. et al. DNA-based programmable gate arrays for general-purpose DNA computing. Nature 622, 292–300 (2023).
Article ADS CAS PubMed Google Scholar
Qian, L., Winfree, E. & Bruck, J. Neural network computation with DNA strand displacement cascades. Nature 475, 368–372 (2011).
Article CAS PubMed Google Scholar
Genot, A. J. et al. High-resolution mapping of bifurcations in nonlinear biochemical circuits. Nat. Chem. 8, 760–767 (2016).
Article MathSciNet CAS PubMed Google Scholar
Hassija, V. et al. Interpreting black-box models: a review on explainable artificial intelligence. Cogn. Comput. 16, 45–74 (2024).
Article Google Scholar
Graziani, M. et al. A global taxonomy of interpretable AI: unifying the terminology for the technical and social sciences. Artif. Intell. Rev. 56, 3473–3504 (2023).
Article PubMed Google Scholar
Kundu, S. AI in medicine must be explainable. Nat. Med. 27, 1328–1328 (2021).
Article CAS PubMed Google Scholar
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Article CAS PubMed Google Scholar
Kingsford, C. & Salzberg, S. L. What are decision trees?. Nat. Biotechnol. 26, 1011–1013 (2008).
Article CAS PubMed PubMed Central Google Scholar
Roy, O. P. & Kavoussi, L. R. Testicular cancer: decision tree model has potential to improve NSGCT management. Nat. Rev. Urol. 7, 308–310 (2010).
Article PubMed Google Scholar
Wang, W., Ge, Q. & Zhao, X. Enzyme-free isothermal amplification strategy for the detection of tumor-associated biomarkers: a review. TrAC, Trends Anal. Chem. 160, 116960 (2023).
Article CAS Google Scholar
Gines, G. et al. Isothermal digital detection of microRNAs using background-free molecular circuit. Sci. Adv. 6, eaay5952 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Seelig, G., Soloveichik, D., Zhang, D. Y. & Winfree, E. Enzyme-free nucleic acid logic circuits. Science 314, 1585–1588 (2006).
Article ADS CAS PubMed Google Scholar
Qian, L. & Winfree, E. Scaling up digital circuit computation with DNA strand displacement cascades. Science 332, 1196–1201 (2011).
Article ADS CAS PubMed Google Scholar
Thubagere, A. J. et al. Compiler-aided systematic construction of large-scale DNA strand displacement circuits using unpurified components. Nat. Commun. 8, 14373 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, B. et al. Effective design principles for leakless strand displacement systems. Proc. Natl Acad. Sci. USA 115, E12182–e12191 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Song, T. et al. Improving the performance of DNA strand displacement circuits by shadow cancellation. ACS Nano 12, 11689–11697 (2018).
Article CAS PubMed Google Scholar
Zhang, D. Y. & Winfree, E. Control of DNA strand displacement kinetics using toehold exchange. J. Am. Chem. Soc. 131, 17303–17314 (2009).
Article ADS CAS PubMed Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Albu, A. From logical inference to decision trees in medical diagnosis. In The 6th IEEE International Conference on E-Health and Bioengineering (EHB) (2017).
American Diabetes Association Professional Practice Committee. 2. Classification and diagnosis of diabetes: standards of medical care in diabetes—2022. Diabetes Care 45, S17–S38 (2022).
Breiling, A. & Lyko, F. Epigenetic regulatory functions of DNA modifications: 5-methylcytosine and beyond. Epigenet. Chromatin 8, 24 (2015).
Article Google Scholar
Gaiser, T. et al. DNA-methylation analysis as a tool for thymoma classification. Cancers 14, 5876 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, J. & Zhang, Q. Accelerating toehold-mediated DNA strand displacement reaction using polyquaternium. Chem. Eur. J. 31, e202404249 (2025).
Article CAS PubMed Google Scholar
Kim, Y. J., Rho, W.-Y., Park, S. -m & Jun, B.-H. Optical nanomaterial-based detection of biomarkers in liquid biopsy. J. Hematol. Oncol. 17, 10 (2024).
Article PubMed PubMed Central Google Scholar
Yang, L. et al. A spatially localized DNA linear classifier for cancer diagnosis. Nat. Commun. 15, 4583 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Singh, A., Patel, G. & Patel, S. S. Twinkle-catalyzed toehold-mediated DNA strand displacement reaction. J. Am. Chem. Soc. 145, 24522–24534 (2023).
CAS Google Scholar
Banerjee, M., Reynolds, E., Andersson, H. B. & Nallamothu, B. K. Tree-based analysis. Circ. Cardiovasc. Qual. Outcomes 12, e004879 (2019).
Article PubMed PubMed Central Google Scholar
Mira, J. M. Symbols versus connections: 50 years of artificial intelligence. Neurocomputing 71, 671–680 (2008).
Article Google Scholar
Hitzler, P. et al. Neuro-symbolic approaches in artificial intelligence. Natl Sci. Rev. 9, nwac035 (2022).
Article PubMed PubMed Central Google Scholar
Liu, J. DNA sequence information. Figshare https://doi.org/10.6084/m9.figshare.30339310 (2025).
Article Google Scholar
Li, L. C. & Dahiya, R. MethPrimer: designing primers for methylation PCRs. Bioinformatics 18, 1427–1431 (2002).
Article CAS PubMed Google Scholar
Zadeh, J. N. et al. NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32, 170–173 (2011).
Article ADS CAS PubMed Google Scholar
Liu, J. DNAdecisiontree: code release. Zenodo https://doi.org/10.5281/zenodo.17332060 (2025).
Article Google Scholar

Download references

Acknowledgements

We thank Guangyong Chen (Zhejiang Lab, Zhejiang University), Hui Lv (Shanghai Jiao Tong University), and Xiaolin Huang (Shanghai Jiao Tong University) for insightful suggestions and discussions about this work. This work was supported by grant from the National Key Research and Development Program of China (2021YFA0909400) to D.H.; by grants from the National Natural Science Foundation of China (82402734 to J.L.; 22225402 and 32341017 to D.H.); by grant from the Natural Science Foundation of Shanghai (24ZR1445000) to J.L.; by grant from Shanghai Public Health Special Research Project (2024GKQ35) to J.L.

Author information

Authors and Affiliations

Institute of Molecular Medicine (IMM), Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
Junlan Liu, Yongqi Han, Jinxing Song, Chunhai Fan, Weihong Tan & Da Han
Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang, China
Qian Tang, Pei Guo, Weihong Tan & Da Han
School of Chemistry and Chemical Engineering, New Cornerstone Science Laboratory, Frontiers Science Center for Transformative Molecules, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, China
Fei Wang & Chunhai Fan

Authors

Junlan Liu
View author publications
Search author on:PubMed Google Scholar
Qian Tang
View author publications
Search author on:PubMed Google Scholar
Yongqi Han
View author publications
Search author on:PubMed Google Scholar
Jinxing Song
View author publications
Search author on:PubMed Google Scholar
Fei Wang
View author publications
Search author on:PubMed Google Scholar
Pei Guo
View author publications
Search author on:PubMed Google Scholar
Chunhai Fan
View author publications
Search author on:PubMed Google Scholar
Weihong Tan
View author publications
Search author on:PubMed Google Scholar
Da Han
View author publications
Search author on:PubMed Google Scholar

Contributions

J.L. and D.H. conceived the project. J.L. designed and performed the experiments. Q.T. performed data visualization. J.L., Y.H., and J.S. analyzed the fluorescence kinetics data. J.L. performed data visualization and wrote the initial draft. F.W., P.G., C.F., W.T., and D.H. revised the manuscript. D.H. and W.T. supervised the entire study.

Corresponding authors

Correspondence to Pei Guo, Chunhai Fan, Weihong Tan or Da Han.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Nicolas Lobato-Dauzier, Yifan Lyu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, J., Tang, Q., Han, Y. et al. Interpretable molecular decision-making with DNA-based scalable and memory-efficient tree computation. Nat Commun 16, 10311 (2025). https://doi.org/10.1038/s41467-025-66610-1

Download citation

Received: 26 May 2025
Accepted: 09 November 2025
Published: 21 November 2025
Version of record: 21 November 2025
DOI: https://doi.org/10.1038/s41467-025-66610-1

Subjects

Abstract

Similar content being viewed by others

Supervised learning in DNA neural networks

A spatially localized DNA linear classifier for cancer diagnosis

DNA as a universal chemical substrate for computing and data storage

Introduction

Results

Overview of the DNA-based decision tree computation

Embedding logic rules within a single decision tree

Integrating multiple decision trees for building a Random Forest algorithm

Embedding diagnostic rules within DNA-encoded decision trees

DNA-encoded decision tree for thymoma subtyping based on methylation biomarkers

Discussion

Methods

Oligonucleotide synthesis

Assembling of node-encoding DNA duplexes

Classifier training and validating

Sequence design of molecular inversion probes

Thermostable ligase-dependent probe circularization

Fluorescence kinetics experiment

Data normalization

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links