CLADES: a hybrid NeuralODE-Gillespie approach for unveiling clonal cell fate and differentiation dynamics

Gao, Mingze; Barile, Melania; Chabra, Shirom; Haltalli, Myriam; Calderbank, Emily F.; Chao, Yiming; Zheng, Weizhong; Wilson, Nicola K.; Laurenti, Elisa; Göttgens, Berthold; Huang, Yuanhua

doi:10.1038/s41467-025-63150-6

Download PDF

Article
Open access
Published: 01 September 2025

CLADES: a hybrid NeuralODE-Gillespie approach for unveiling clonal cell fate and differentiation dynamics

Nature Communications volume 16, Article number: 8174 (2025) Cite this article

5386 Accesses
3 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Recent lineage tracing based single-cell techniques (LT-scSeq), e.g., the Lineage And RNA RecoverY (LARRY) barcoding system, have enabled clonally resolved interpretation of differentiation trajectories. However, the heterogeneity of clone-specific kinetics remains understudied, both quantitatively and in terms of interpretability, thus limiting the power of barcoding systems to unravel how heterogeneous stem cell clones drive the overall cell population dynamics. Here, we present CLADES, a NeuralODE-based framework to faithfully estimate the clone and population-specific kinetics from both newly generated and publicly available LARRY LT-scSeq data. By incorporating a stochastic simulation algorithm (SSA) and differential expression gene (DEGs) analysis, CLADES yields the summary of cell division dynamics across differentiation time-courses and reconstructs the lineage tree of the progenitor cells in a quantitative way. Moreover, clone-level behaviors can be grouped into characteristic types by pooling individual clones into meta-clones for analyses at various resolutions. Finally, we show that meta-clone specific cellular behaviors identified by CLADES originate from hematopoietic stem and progenitor cells in distinct transcriptional states. In conclusion, we report a scalable approach to robustly quantify clone-specific differentiation kinetics of cellular populations for time-series systems with static barcoding designs.

Inferring cell differentiation maps from lineage tracing data

Article Open access 08 December 2025

Generative models of cell dynamics: from Neural ODEs to flow matching

Article Open access 27 February 2026

Multivariate stochastic modeling for transcriptional dynamics with cell-specific latent time using SDEvelo

Article Open access 30 December 2024

Introduction

One of the fundamental challenges in developmental biology is to understand the complicated cellular dynamics in a temporal order^1,2. The interplay between cell proliferation (cellular expansion) and differentiation (phenotypic transition) plays an essential role in various biological processes, e.g., tissue development, regeneration, and the activation of innate immune response mechanisms³.

In recent years, the transcriptome-wide single-cell RNA sequencing (scRNA-seq) technique has been widely used as a scalable approach for studying cellular trajectories, either using a snapshot of transitioning cell populations or via a time-series design. Computational algorithms for dynamical analyses include pseudotime-based trajectory inference^4,5,6,7,8, unspliced RNA-based RNA velocity methods for predicting temporal gene expression changes^9,10,11 and metabolic labeling-based protocols for analyzing temporal dynamics, where nascent RNA is labeled with 4sU^12,13,14.

However, scRNA-seq alone cannot offer fine-grained insights into the clone-level heterogeneity within cell clusters. Therefore, emerging lineage tracing techniques that utilize unique and inheritable DNA barcodes to track individual cells can offer a complementary approach to study cellular dynamics^1,3,15.

Techniques of lineage-tracing coupled with single-cell sequencing (LT-scSeq) include retrospective analyses via endogenous genetic barcodes (e.g., mitochondrial variants and copy number variations^{16,17,18,19,20}) and prospective designs with recently developed barcoding technologies via exogenous barcodes. Prospective designs can be broadly categorized into static barcoding with one-off induction to focus more on the clone-specific differentiation^21,22, and dynamical barcoding with inductions at multiple time points to trace phylogenies^23,24. In this study, we focus primarily on modeling static barcoding LT-scSeq data, where numerous progenitor cells are labeled with a unique barcode at an early point; then the barcode is propagated to all other populations (progeny from here on), hence facilitating the delineation of a high-resolution differentiation topology. One prominent example of this type of technology is the lentivirus-based system^25,26 LARRY²², which has been recently employed, for example, to predict clonal fate bias in hematopoiesis²² and mouse brain formation^27,28, unveiling new regulators/markers involved in reprogramming²⁹, cell differentiation³⁰, and identifying pathways relevant to cancer progression^31,32.

Several computational algorithms have recently been proposed to analyze the population dynamics and LT-scSeq experiments. Depending on the space of the cell states, these methods can be grouped into different categories. First, the continuous models, such as Fischer et al.³³, who developed Pseudodynamics, which models population distribution shifts to quantify developmental potentials for time-series data. Second, a common choice is to mimic a near-continuous space by employing a finite state mapping where each cell is treated as a state, e.g., LineageOT³⁴, which maps cells from the same clone and adjacent time points using optimal transport; it was developed to recover lineage couplings in CRISPR-based lineage tracing datasets and outperforms the original OT method that operates without barcodes³⁵. Another example is CoSpar³⁶, which performs topology mapping under the constraints of sparsity and coherence. This method brought new insights to multi-clonal time series data analysis with respect to the identification of early cell fate bias at the cellular level. Third, discrete state space over cell types is also commonly used to ensure better interpretability, higher robustness, and computational efficiency³⁷. Finally, other approaches have also significantly contributed to this field using various modalities, for example, by learning a smoothed transcription and regulatory dynamics³⁸, analyzing the potency bias during HSC reactivation following platelet depletion³⁹ or utilizing epimutation on DNA methylation for in vivo lineage tracing experiments⁴⁰.

Despite various efforts to unveil more information from LT-scSeq data, several technical challenges persist; in addition to common issues such as the loss of barcodes or small clone size, scRNA-seq is destructive and only captures a fraction of the total cells, resulting in the possibility of tracking only the relative changes of cell state abundance³⁶ rather than the actual kinetic rates that affect the overall population dynamics. This hampers the exploration of clone-specific dynamical patterns. Furthermore, having reliable kinetic rates enables the reconstruction of differentiation topologies and provides a quantitative estimation of division numbers³⁹ as well as the likelihood of producing specific lineage outputs from a progenitor cell, which is useful under various scenarios, e.g., investigating the patterns related to differentiation, aging or disease.

Inspired by recent research efforts, and to address the aforementioned issues, we focused on the static barcoding system, (the LARRY, Fig. 1a) and developed a robust and generalizable algorithm to analyze LT-scSeq datasets, named as CLADES (Clonal Lineage Analysis with Differential Equations and Stochastic simulations), which comprises two key components: 1) a model estimator, based on NeuralODE⁴¹, to delineate clone-specific trajectories and state-dependent transition rates; 2) a data generator, via the Gillespie algorithm⁴², to simulate the differentiation topologies between progenitors and progenies. In general, for both components, CLADES uses the design of meta-clone-specific dynamics (see below) to handle many clones with few cell counts. CLADES also uses scaling factors and Poisson negative likelihood loss to efficiently handle the problems of barcode dropouts during the experiment (Methods).

**Fig. 1: Overview of the experimental design of LT-scSeq, CLADES’s architecture and its robustness on synthetic datasets.**

Collectively, CLADES determines clone-specific dynamics and provides a quantitative description of the differentiation topology between a progenitor and its progeny.

Results

ODE function for (meta-) clone-specific dynamics

We model clone dynamics as a system of independent ordinary differential equations (ODEs) to infer the time-specific transition rates between cell states. The dynamics of each clone (or each meta-clone, namely a group of clones with similar dynamical profiles; see below) are described by the same equation but clone-specific parameters; this aligns with the assumption that only intra-clone transitions are allowed (cells keep the same clone identity throughout the experiment). Therefore, the estimation and inference processes are in parallel for each (meta-) clone. Without loss of generality, we describe the (multivariate) ODE function f for a specific clone c with parameters θ as shown in Eq.(1),

$$f ={{{\boldsymbol{x}}}}^{\prime} (t)\\ ={{{\boldsymbol{x}}}}(t)\cdot {K}_{1}(t)+{{{\boldsymbol{x}}}}(t)\odot \left({K}_{2}(t)-{\sum }_{i=0}^{ncols}{K}_{1}^{T}(t)\right),\\ {{{\boldsymbol{x}}}}(t) ={{{\rm{ODESolver}}}}({{{\boldsymbol{x}}}}({t}_{0}),{t}_{0},t,f,\theta ),\,t\in ({t}_{0},{t}_{1},...,{t}_{n}),$$

(1)

where ⋅ is the dot product of two matrices, ⊙ is element-wise multiplication, x(t) is a tensor of total counts for all cell states (interchangeably as cell populations), which has a dimension of (time-points, meta-clones, populations) and t refers to the real time of the biological system. K₁(t) mimics the differentiation among populations and is based on the edges of the PAGA⁴³ graph (denoted as L, Methods) with expert curation; it is non-negative and strictly upper triangular (see Supplementary Fig. S1a,b). K₂(t) (the diagonal of the matrix L) is a one-dimensional vector, representing the net proliferation rate for each cell state. Generally, K_1,2(t) can either be constant, which gives the classic population growth model with exponential change of population sizes, or a function of the real time t (Fig. 1b), which allows for more flexibility in our model^44,45,46.

To truly estimate the expansion potential of different clones, relying on just the clusters’ relative proportion over time is not enough, as a cluster expanding in relative proportion over time could still result in a decreasing population, only with a slower shrinking rate compared to the other clusters, which would affect the accuracy of estimated rates. We solved this issue by measuring and then factoring in the culture’s expansion, following these steps at each time point: we measured the cells in the culture, then split a known fraction and sequenced the rest. We can thus obtain the real size of a clone (total cell counts from here on) after multiplying the relative sizes of the clusters by a scaling factor (which is a time-specific parameter) that also considers the number of cells lost during quality control (QC) in the sequencing protocol (Methods).

CLADES is designed for time-series lineage tracing experiments and mainly requires two types of input data: 1) the estimated total cell counts x(t) per time point t_i, clone c_j and population p_k; 2) putative transition directions between populations, usually derived from the PAGA graph (Supplementary Note 1) with expert curation, which brings prior knowledge into the model (Fig. 1a, b). Under the common assumption that cellular divisions and differentiations between distinctive states are a stochastic process governed by a set of transition rates, we then interpolate the total cell counts on un-sequenced time points using NeuralODEs with biologically informed constraints (Fig. 1b, Methods). The underlying logic is solving an optimization problem between observations and model predictions with meaningful penalties.

CLADES takes the input data and feeds them into a multi-layer perceptron (MLP, Methods) with 2 layers, from which the model outputs the rate transition matrices (Supplementary Note 2) among populations and the predicted cell counts using an ODE solver. The rates can be either time-invariant or time-variant at the user’s discretion. After acquiring the rates, CLADES can reconstruct the dynamic changes of cell counts at both the population and the clone levels, and derive the associated confidence intervals (CI) of the kinetic rates (Fig. 1c) as well, providing a measure of uncertainty. By means of the Gillespie algorithm (Supplementary Note 3, Methods), CLADES can also provide a quantitative summary of division topologies and assess the probability of realizing the different fates of any progenitor cell type (Fig. 1d).

Performance and robustness of CLADES on synthetic datasets

In order to assess both the performance of the constant and the dynamic mode of CLADES, we used synthetic datasets as described in Supplementary Note 4.

We applied CLADES on a synthetic dataset governed by time-invariant rate ODE functions first, where the rates for each clone serve as the ground truth and cell counts are governed by Eq.(1). To test whether CLADES can recover the correct dynamics and provide guidance for appropriate usage with the LT-scSeq experiments, we conducted 4 independent trials with different sampling intervals as training sets and used the same 5 unobserved time points as testing sets, for both the constant and the dynamic modes of the model (Supplementary Table S1).

Generally, the performance of CLADES (Methods) improved along with the number of training time points, Fig. 1e (i, ii). Since the cell counts are the product of the transition rates, we observed similar trends in the absolute error of transition rates and the correlation of cell counts (Supplementary Fig. S2a). And the constant mode consistently performs better than the dynamic mode; this is reasonable as the synthetic data was generated based on the time-invariant rates, indicating the risk of overfitting for the dynamic mode when the underlying pattern of data is clean and simple (Fig. 1e). We also assessed the robustness of CLADES given datasets with different synthetic noise levels (Methods, Supplementary Note 4). Here we used trial index #3 (Supplementary Table S1) as an example, and we could observe that the constant mode performs fairly well if the noise level is smaller than 10, and gradually the performance deteriorates with a noise level higher than 10, Fig. 1e (iii).

With the time-variant scenario, we aim to demonstrate the limitations of the constant mode when facing complex data structures and highlight the flexibility and robustness of the dynamic mode. Specifically, similar synthetic settings were adopted (Supplementary Table S2), whilst this time the data was generated using time-variant rates. For both the constant and the dynamic modes, the performance increased when more training data were given, Fig. 1f (i, ii). However, the overall performance and additional evaluation metrics of the constant mode are always worse (Supplementary Fig. S2b). Interestingly, we also noticed that, apart from outperforming the constant mode when applied to its paired data (Fig. 1f (i, ii)), the dynamic mode generally yields better results even when it is applied to data generated via the other mode, Fig. 1e (ii) and Fig. 1f (i), suggesting that the dynamic mode is more robust to the ’unpaired patterns’. The performance of CLADES under different noise levels was assessed as well, (trial index #3, Supplementary Table S2). As shown in Fig. 1f (iii), the dynamic mode retains a recovery rate > 80% if the noise level is smaller than 15, and the performance gradually decreases after that.

In conclusion, for datasets with relatively simple patterns, the constant mode is suitable with tolerable errors, whilst for other biological systems that are more sophisticated, the dynamic mode of CLADES is preferred.

Characterizing the (meta-) clone-specific kinetic rates in human cord blood

We applied CLADES to a newly generated in vitro LARRY-based LT-scSeq data with three time points (24,885 barcoded out of 68,856 high-quality cells; see details in Supplementary Table S3 and Supplementary Note 5). The DNA barcodes were induced at Day 0 in CD34⁺ hematopoietic stem and progenitor cells (HSPCs) and sampled for sequencing at Day 3, 10 and 17 (Fig. 2a, Fig. 2b right). Upon culture, this pool of HSPCs differentiated into 12 distinct cell populations of progenitor or mature blood cells (Fig. 2b left), across the erythroid, myeloid, dendritic cell (DC) and mast cell (MC) lineages.

**Fig. 2: CLADES accurately reconstructed the clone-specific dynamics on a LARRY LT-scSeq data of human cord blood development system.**

DNA barcodes with few cell counts make the analysis of individual clones infeasible (barcode matrix is shown in Fig. 2c and Supplementary Fig. S3); thus, to further reduce stochasticity and model complexity, clones were clustered into meta-clones (Supplementary Fig. S4) based on the similarity between time and population-specific counts of barcoded cells (Supplementary Fig. S5, Methods). The assumption is that hematopoiesis can be conceptualized as a clonal process, where clones have a finite number of differentiation behaviors; clones within a meta-clone will have similar kinetic rates and differentiation outcomes. As the initial barcoded HSPCs are heterogeneous, including both multi-potent and more lineage-restricted progenitors, we expect our approach to retrieve meta-clones with distinct kinetics of differentiation, being initiated by specific subsets of HSPCs. Indeed, we found 12 meta-clones with distinct outputs to four terminal fates (Mast cell, Late Erythroid, Monocyte and DC) (Fig. 2d, Supplementary Fig. S6 and Table S5). As an example, meta-clone 7 predominantly produces the Monocyte lineage, which originates from the most primitive HSC/MPP1 subpopulation, whilst meta-clone 0 predominantly produces cells of the mast cell and Erythroid lineage and correspondingly is initiated by Megakaryocyte Erythroid and Mast cell Progenitors (MEMP).

Given expert-curated putative transition directions derived from PAGA (Supplementary Fig. S1b), CLADES successfully predicted the total cell counts at the experimental time points (see Supplementary Table S4) and interpolated counts on unknown time points along the entire trajectory using both constant and dynamic modes while also providing associated estimation errors (Fig. 2f, Supplementary Fig. S7). Notably, the dynamic mode proved to have better performance provided that proper constraints are enforced to prevent model over-fitting.

As stated in equation (1), the population balance between cell states is governed by transition rates, which are the per capita output within a unit of time. Of note, the maximum inferred proliferation rate (2.5 per day) is consistent with the cell-cycle not lasting less than 10 h, which is biologically reasonable; differentiation frequency does not exceed 2.5 events per day (Methods, Fig. 2e, upper panel). Moreover, the weighted average behavior of all meta-clones should resemble the overall dynamics of the system (including both the barcoded and non-barcoded cells, background from here on, or BG for short, Fig. 2e, lower panel). CLADES calculates a transition rate matrix among cell states for any meta-clone at any time stamp (Fig. 2g).

As the transition rates are point estimates, CLADES uses bootstrapping (Methods) to estimate the 95% CI of the transition rates (two examples in Fig. 2h and complete results in Supplementary Fig. S8) and statistical tests to assess the significance of dynamical rates among meta-clones (Methods, Supplementary Fig. S9). Moreover, we only consider rates inferred for populations that have at least four cells at the respective time point. Interestingly, distinct meta-clones exhibit differences in rates at specific stages of progenitor/precursor maturation. As an example, here we compare two meta-clones, meta-clones 2 and 7, both originating in HSC/MPP 1, but which produce strikingly different differentiated output at day 17 (see pie charts of Fig. 2i, left panel). The differentiated output of meta-clone 2 consists of approximately 50% DCs and 40% monocytes. In contrast, the differentiated output of meta-clone 7 comprises less than 1% DCs and 85% monocytes. Importantly, CLADES estimates transition rates between distinct subpopulations that are consistent with the differentiation behaviors of meta-clones 2 and 7 (Fig. 2i, right panel), with, for example, a significantly higher transition rate from HSC/MPP 2 to DC progenitors on day 10 and day 17 in meta-clone 2 than in meta-clone 7 and conversely, significantly higher transition rates within the Monocytic compartment for meta-clone 7 than meta-clone 2 on Day 10 and 17 (Fig. 2i, right panel).

Resolving cell division history and fate realization of progenitor cells

After estimating transition rates between cell states using CLADES, we then employed the modified Gillespie algorithm (a stochastic simulation algorithm originally used to depict chemical reaction processes, algorithm 1) to further delineate the behavior of the meta-clones describing our system. A common workflow of a Gillespie simulation trial starts with a single progenitor (e.g., HSC/MPP 1). For each step, the time interval until the next reaction (proliferation, differentiation, or apoptosis) is extracted from an exponential distribution (where the parameter λ is a state-rate-dependent value); then, a reaction is picked to occur based on the currently available cell states and the previously estimated transition rate matrices. The simulation continues and updates cell states until certain stopping criteria are met (Fig. 3a).

**Fig. 3: Gillespie algorithm reveals division summary and lineage realization of progenitor populations.**

We ran Gillespie on all meta-clones and, for each meta-clone, we inferred several properties, including: 1) the number of cell division events that occurred between the original HSC/MPP 1 (or other progenitor cell of interest) and the first cell produced for each progeny; 2) the number of trials where a certain population was produced out of all possible simulation trials (1000 in this case). Our simulations, therefore, yielded a quantitative summary of differentiation topology and terminal state realization for each meta-clone. After combining all the simulation results together, we could compare the differences in fold change among all meta-clones or with respect to the background cells (Supplementary Fig. S10a for different initial conditions and S10b for the comprehensive comparison between meta-clones). Comparing meta-clones against each other, we could identify correlations between division events of a particular lineage and the potency of a meta-clone. For example, for meta-clone 2, which produces large numbers of DCs, the Gillespie simulation estimated approximately five divisions from HSC/MPP 1 to the production of the first DC. In contrast, for meta-clone 7, which does not produce DC effectively, more than 12 divisions are predicted (Fig. 3b). For the myeloid lineage, for instance, it takes meta-clone 0 around 5 division events on average to produce the first Mast cell, and ten division events for the early erythroid respectively, which is similar with that of the background (Fig. 3c). This indicates that meta-clone 0 shows similar behavior with others in the myeloid lineage.

Next, we scored the capability of a meta-clone to produce progeny. To this aim, we needed to define when to consider that a cell type has been produced. We opted for setting the threshold of at least 40 cells to exist in the compartment at any time point. Meta-clones have distinct differentiation behaviors (Supplementary Fig. S6). Some of them are multi-potent; for instance, meta-clones 2 and 7 have 4 terminal states, while meta-clone 0 and 6 have 3. Others are either bi-potent (meta-clone 5) or uni-potent, being committed to a specific lineage such as Monocytes (meta-clone 4), see Fig. 3d. We also noticed that meta-clones with a similar fate realization, but different total regenerative outputs, can also bear differential kinetic rates (Fig. 3b, Supplementary Table S5). Generally, we saw that lineage output is highly meta-clone specific, and that the probability of producing each progeny (including both intermediate and terminal states) given a progenitor can be inferred probabilistically using Gillespie simulations. Indeed, the number of simulation trials in which a specific progeny is produced directly correlates with the variety in lineage realization (Fig. 3d).

CLADES recapitulates the cellular dynamics of murine hematopoiesis

We applied CLADES to a publicly available mouse hematopoietic dataset that was introduced by Weinreb et al.²². This dataset contains around 130,000 sequenced cells (Supplementary Table S6), allowing us to explore the gene signatures corresponding to differential outputs. Compared with the dataset described in the previous section (4 terminal fates out of 12 cell states), there are more potential terminal fates (10 out of 22 cell states) in this dataset. While the number of time points is the same (3 sequenced time points, days 2, 4 and 6, Fig. 4a), cells are followed over only 6 days, and therefore the extent of differentiation is lower than that of the cord blood dataset.

**Fig. 4: CLADES captures the key dynamics of mouse hematopoietic system.**

This dataset has 5859 unique clones in total, with some of them being multi-potent whilst others being uni-potent. Of note, only 1989 clones (also with limited number of cells) appeared in the terminal states. This result indicates the low capture rate of barcodes and justifies the necessity to merge individual clones into the meta-clones. We then followed the same preprocessing pipeline as the cord blood dataset and constructed 13 meta-clones using again the time and state-dependent number of cells in each clone as features (Fig. 4b for combined UMAP, Supplementary Fig. S11 for used features and S12 for UMAP of each meta-clone). Distinct behaviors can be seen in different meta-clones (Supplementary Table S8). The number of multi-potent meta-clones is higher than the CB dataset, with several giving rise to seven or more terminal states (meta-clones 1, 2, 3, 4). Of interest two meta-clones, 5 and 8 produce very few cells from any terminal state with the progeny of early time-points retained within the most primitive progenitor space (Supplementary Fig. S12). As this dataset does not contain flow cytometry-based counting, we used an estimated fold expansion of hematopoietic stem and progenitor cells in culture as an alternative to the total cell counts and scaling factors (Methods). We applied both constant and dynamic modes of CLADES using the PAGA graph with expert curation as the guided transition directions (Fig. 4c). Interestingly, we found that both modes performed similarly, suggesting that, in this dataset, the kinetics of the in vitro system do not change much during the time span of the experiment (Supplementary Table S7).

The distribution of estimated proliferation as well as differentiation rates provided by CLADES overall falls within a reasonable range (Fig. 4d, left panel), with the weighted average of meta-clones across all evaluated time points resembling the rates of the whole system with a Pearson correlation score of 0.819 (Fig. 4d, right panel). Using our method, the likelihood of each meta-clone to produce differentiated progeny can be calculated, thereby simulating which progenitor would be potentially responsible for the lineage output per meta-clone. For example, if this simulation is begun from prog_2, the expected potency can be seen for multiple meta-clones, including meta-clones 1, 2, 3 and 4 (Fig. 4e (i)). Of note, this is not the case for meta-clones 0 and 7, yet if these simulations are begun from prog_3, the expected lineage realization is seen (Fig. 4e (ii)), indicating that prog_3 is the main contributor to the lineage output of these clones. This further highlights the utility of the model to resolve the contribution of each progenitor to the global differentiated output.

Clonal kinetics and characteristics can be inferred from early coordinated gene signatures

We showed earlier that the difference in outputs in terms of time-scales and fate realization of early progenitors unveils differences in transition rates. We thus sought to connect such differences to the possible heterogeneity in the transcriptomic signatures, arguing that the latter may coordinate the transition rates towards each lineage and terminal fates.

Initial comparisons of differentially expressed genes (DEGs) within the earliest progenitor population highlighted the recapitulated lineage priming reported by Weinreb et al.²². For instance, meta-clone 0 expresses genes associated with monocytes (Rbms1 and Sirpa) whereas meta-clone 4 expresses Podxl, Pbx1 and Igals9, which are associated with the megakaryocytic lineage (Supplementary Fig. S14).

The 13 meta-clones identified in this dataset have distinct behaviors (Fig. 5a, Supplementary Fig. S14 and S15). Even meta-clones with similar differentiation outputs, for example meta-clones 0 and 1, which both give rise to late progenitors in the monocyte and neutrophil lineages, differ in the final cell output of the terminal states at day 6 (predominance of monocyte progenitors in meta-clone 0 and of neutrophil progenitors in meta-clone 1; Fig. 5b). To further investigate the gene expression programs at play within the progenitor populations of these meta-clones, we analyzed the molecular signatures in the prog_1 population from the mouse hematopoietic dataset, as prog_1 is the most primitive progenitor population represented in this dataset (Fig. 4a, c). We calculated DEGs between meta-clone 1 and 0 within this population (Fig. 5c). We found meta-clone-specific genes that are characteristic of specific hematopoietic lineages and suggest lineage priming even in the most primitive progenitors (Fig. 5d). For example, up-regulation of genes important in neutrophils (Cd48, Chek1, Cited2, Pcna and Thy1) can be seen in meta-clone 1 whilst genes specifically related to monocytes can be seen in meta-clone 0 (Ccl24, Ccr2 and Zfp36) (Fig.5e). Whilst it was previously demonstrated that the potential potency of these meta-clones is generated from different progenitor populations, the subsequent progenitor population (prog_2) exhibits further lineage specific gene signatures (Fig. 5f–h), including lineage-specific transcription factors such as Gfi1 and Cebpe for neutrophilic lineages (meta-clone 1) and cell surface markers associated with monocytes (Cd33 and Cd52, meta-clone 0).

**Fig. 5: The lineage realization and kinetics of each meta-clone is determined by its early transcriptomic signatures.**

As mentioned previously, both meta-clones 5 and 8 are retained within the primitive progenitor space during the 6-day incubation period with extremely low differentiation rates to terminal states, with most of the daughter cells remaining in prog_1 and prog_2 cell state and a few in prog_4 (see Fig. 5a). To further analyze this unique behavior, we computed the DEGs between meta-clone 5 and all other meta-clones within prog_1 cell state (Fig. 5i). This showed significantly upregulated expression of genes associated with proliferation and self-renewal (Tgfbr3, Evi5, S1pr1, Selp and Yes1) whilst genes associated with specific hematopoietic lineages were downregulated (Gzmb, Elane, Mpo, Irf1). Interestingly, differential gene expression of meta-clone 8 and all other meta-clones within prog_1 population highlighted high expression of markers associated with HSCs (Fgd5) and HSPC entry into cell cycle (Ccne1), and downregulation of genes associated with lineage specification (Gfi1, Ccl3 and Cxcl10), potentially explaining the relative lack of differentiation observed in these meta-clones.

Nevertheless, we saw with our analysis that meta-clones with similar fate realizations can produce progeny in different proportions. For example, meta-clones 2 & 7 of the human cord blood data are both multi-potent, but have different offspring sizes (Supplementary Table S8). Meta-clones 0~4 of the mouse hematopoietic dataset can generate most of the terminal states, but the contribution to each population differs, as shown in Fig. 5a and Supplementary Table S8. This analysis suggests that we can further identify sub-states of progenitor cells based on the magnitude and the rates at which their progeny are produced.

Similarly, the DEGs of meta-clones 1~4 compared to meta-clone 0 (Supplementary Fig. S16 & S17) illustrate that, for multi-potent clones, the transcriptomic heterogeneity at early stages can possibly be linked to differences in fate output and population size of the offspring.

In conclusion, meta-clones analysis allows for a fine-grained definition of specific differentiation behaviors, including both lineage preference as well as size of the generated progenies. In addition, transcriptomic signatures can be assigned to these behaviors, providing a molecular basis for such behaviors.

Discussion

In this paper, we presented CLADES, a NeuralODE-based method to estimate both the proliferation and differentiation rates along the time course data. With the Gillespie algorithm, CLADES delineates detailed division topology and quantitatively summarizes the lineage output of progenitor cells. Importantly, the cellular behaviors identified by CLADES through the meta-clone approach can be associated with specific transcriptional states of the cells initiating these behaviors at the earliest time point.

Many mathematical models for hematopoiesis have assumed that adult physiological hematopoiesis is in perfectly homeostatic conditions (referred to as “steady state”), where kinetic rates are constant values. On the other hand, more recent models^37,47 have begun incorporating the idea that, given that the relative abundance of the different populations, including HSCs, varies with aging, dependent on the context, there isn’t necessarily a steady state. CLADES can partially address this problem thanks to the constant vs dynamic mode option. Indeed, if the constant mode outperforms the dynamic one, we can conclude that the kinetic rates are nearly constant, which may lead to a stationary growth or a steady state, whilst if the dynamical mode performs better, then the rates are time-dependent and it is more likely that the system is not in a steady state.

Though CLADES has offered new perspectives to explore the LT-scSeq data, there are still a few challenges and limitations that exist. Firstly, CLADES requires a multi-shot lineage tracing design, limiting its use to experimental designs with at least 2, ideally 3, sequencing time points.

Secondly, CLADES is formulated based on the static barcoding with viral integration, and for now, it cannot be applied to analyze data from cumulative barcoding techniques (like CRISPR-Cas9 DNA editing^24,48, which allows multi-time barcoding, and hence provides a fine-grained structure of sub-clones) or in vivo experiments⁴⁰. However, for retrospective barcoding (often with endogenous genetic variants, for instance, MAESTER⁴⁹), we expect that CLADES can be applied with minimal extension.

Thirdly, although CLADES aims to quantitatively summarize the differentiation kinetics and lineage output of each clone as the primary goal, while examining the regulation and determination from the cellular transcriptome in a separate step, the current model only uses cells with barcodes and, therefore, lacks analysis of cells without a barcode. Future work includes combining both barcodes and transcription data to infer the kinetic rates and lineage output, and mapping cells without barcoding information to meta-clones, since limited barcoding efficiency can affect the performance of the algorithm and DEG analysis as well.

Finally, although we focused on the hematopoietic system in both humans and mice to evaluate our model, CLADES is, in principle, broadly applicable to analyze other developmental systems. Moreover, it may be further applied to study cancer progression, where different clones may have distinct phenotypic properties, e.g., cancer plasticity⁵⁰.

Methods

Ethical Statement

Human cord blood biological samples were sourced ethically, and the research was conducted in accordance with the terms of the informed consents under an institutional review board/research ethics committee-approved protocol as specified below. One umbilical cord blood (CB) sample from one male newborn (as assigned as birth) was obtained with informed consent from a healthy donor by the Cambridge Blood and Stem Cell Biobank (CBSB) in accordance with regulated procedures approved by the relevant Research and Ethics Committees (18/EE/0199 and 24/EE/0116 Research Studies). No participant compensation was provided.

Parameter definitions and data structures

Denote the number of time points available as T, the number of meta-clones as C and the number of populations as P. We define the following terms for each individual meta-clone c,

${x}_{t,p}\in {{\mathbb{R}}}^{T * P}$: original real number of cells in the dish for population p and time t;
${y}_{t,p}\in {{\mathbb{R}}}^{T * P}$: number of cells sequenced from the dish for population p and time t;
${K}_{1}\in {{\mathbb{R}}}^{P * P}$: transition matrix of cell differentiation rates between populations. Its values are constrained to be non-negative and strictly upper triangular (to avoid the reversed differentiation process);
${K}_{2}\in {{\mathbb{R}}}^{1 * P}$: rates of the overall effects of proliferation and apoptosis processes combined within a population, diagonal of the topology graph L. These rates can be negative or positive, depending on whether apoptosis exceeds proliferation or not;
L ∈ {0, 1}^P*^P: topology graph of cell states, binary version of K₁ + K₂, derived from PAGA with expert curation; after creating the graph edges, we inspected the biological plausibility of each edge and removed transitions that can be confidently ruled out based on previous knowledge;
P_apop ∈ {0, 1}¹*^P: vector of fully differentiated populations (terminal fates) with limited proliferation ability;
P_prol ∈ {0, 1}¹*^P: vector of progenitor populations (e.g., HSCs) with strong proliferation ability;
${\mu }_{t}\in {{\mathbb{R}}}^{+}$: scaling factor between x_t,p and y_t,p for time t;
t_cut: stabilization term used in the modified Gillespie algorithm, a trade-off between simulation accuracy and time complexity, default is 1e⁻⁴;
l_i: penalty terms used to regularize parameters K_1,2, where $i\in {\mathbb{Z}}$;
λ ∈ (0, 1]: adjustable parameter controlling the magnitude of each penalty term in the loss function.

Annotation of the human cord blood dataset

To annotate the cord blood dataset, we first transferred the labels from the fetal liver atlas published by Popescu et al.⁵¹ by means of the Seurat label transfer algorithm (functions FindTransferAnchors and TransferData). We then clustered our landscape by means of the Leiden algorithm in the scanpy package and assigned to each cluster the cell type of the most commonly transferred label. Finally, for the two clusters labeled as HSC/MPP from the label transfer, we manually inspected the expression of genes known to be highly expressed in the most immature HSC/MPPs and also compared genes differentially expressed between these 2 clusters. From this inspection, the cluster with the most immature features was labeled HSC/MPP1, while the one with differentiative features was labeled HSC/MPP2.

NeuralODE-based architecture

Given a population balance model, the per capita growth/transition rates (Eq. (2)) can be treated as either time-invariant (constant value) or time-variant (they assume a different value at each time point). For a time-invariant scenario, the K_1,2 themselves are the trainable parameters, whilst for the time-variant scenario, the ODE block is built upon a 2-layer multi-layer perceptron (MLP, the number of hidden dimensions is dependent on the number of populations, default is 32) with x_t,p as input and K_1,2 as output. Softplus activation function was used since it has a unique gradient, which is theoretically better than other non-smooth non-linear activation functions such as ReLU and LeakyReLU, given the inner characteristics of NeuralODE⁴¹. K₁ is further masked by the topology graph L to confine the empirically infeasible direction of transitions (e.g., backward transitions or transitions from Late Erythroids to Monocytes). Squaring ensures the inferred rates to be non-negative in K₁ and the overall transition matrix π(t) can be inferred when combining the estimated rates; the diag function is used to transform a vector into a zero-like matrix where the diagonal is that vector. In summary:

$${K}_{1}(t) =\left\{\begin{array}{ll}{w}_{1}^{2} * L\hfill \quad &{{{\rm{constant}}}}\,{{{\rm{mode}}}}\\ {[ML{P}_{1}(x(t,p))]}^{2} * L\quad &{{{\rm{dynamic}}}}\,{{{\rm{mode}}}}\end{array}\right.\\ {K}_{2}(t) =\left\{\begin{array}{ll}{w}_{2}\hfill \quad &{{{\rm{constant}}}}\,{{{\rm{mode}}}}\\ ML{P}_{2}(x(t,p))\quad &{{{\rm{dynamic}}}}\,{{{\rm{mode}}}}\end{array}\right.\\ \pi (t) ={K}_{1}(t)+diag({K}_{2}(t))$$

(2)

where the parameters w₁, w₂ are the weight matrix and the weight vector, respectively. MLP with 2 dense layers and a relatively small hidden dimension was used because the model is run at the meta-clone rather than at the individual clone scale, which significantly reduces the number of rates to be inferred, and more hidden layers (or larger layer dimensions) would inevitably lead to the risk of over-fitting.

Scaling factor for experimental cell counts

As scRNA-seq is destructive by nature, cells sequenced at later time points cannot quantitatively reflect the accurate dynamics of cell counts during this period. Since our model is based on the real number of cells at each time point, sequenced cell counts were scaled back to total cell counts in the culture environment based on additional information (e.g., either manual counts or data from fluorescence-activated cell sorting, FACS⁵²), before being fed to the model.

In order to calculate the scaling factor between sequenced counts and real total counts of the human cord blood data, the number of cells in the dish was measured at each sequencing time (Supplementary Table S3, Supplementary Table S5). Then the estimated total number of cells at each time point is computed in a chained cascading way:

$${x}_{1} ={y}_{1}\cdot \frac{FA{C}_{1}}{{c}_{1}}\\ {x}_{2} ={y}_{2}\cdot \frac{FA{C}_{2}}{{c}_{2}}\cdot \frac{FA{C}_{1}}{FA{C}_{1}-{y}_{1}}\\ {x}_{3} ={y}_{3}\cdot \frac{FA{C}_{3}}{{c}_{3}}\cdot \frac{FA{C}_{2}}{FA{C}_{2}-{y}_{2}}\cdot \frac{FA{C}_{1}}{FA{C}_{1}-{y}_{1}}$$

(3)

where FAC_1,2,3, y_1,2,3, and c_1,2,3 are the numbers of cells sorted in the dish, sequenced in the experiment, and with clonal information at different time points, respectively, and x_1,2,3 is a cell count tensor with a shape of (time, meta-clone, population). Of note, for downstream analysis, we only consider rates inferred for populations that have at least 4 cells (before scaling) at the respective time point.

For the cord blood dataset, we introduced some restrictions based on biological knowledge, by forcing: 1) all Ery populations (Early Ery, Mid Ery, Late Ery) to have zero counts at day 0 and day 3, 2) all DC populations (DC and DC precursors) to have zero counts at day 0 and day 3, 3) all HSC/MPP1 populations to have zero counts at day 17.

The scaling factors for the mouse hematopoiesis data are calculated similarly, except that we used the fold change of cell counts provided by the authors, as the exact number of cells cultured is not accessible.

Initial conditions of the NeuralODE framework

Solving an ODE system is essentially an initial value problem (usually the initial value is the first available time point). Static barcoding techniques like LARRY offer a retrospective ground truth, that if a barcode is seen at a later time point, one cell must have been barcoded with it at the induction time point. The LARRY system produces a vast variability of combinations of barcodes, making the possibility of having two cells with the same barcode extremely low. Therefore, this experimental protocol of LARRY allows inferring the initial condition (unobserved data at Day 0), which can be used by CLADES.

Specifically, apart from the sequenced data, we manually added an extra time point, Day 0, to the dataset, in which the number of initially labeled cells equals the number of unique barcodes, e.g., for each meta-clone. However, we do not know which populations were initially barcoded. Assuming that the distribution of barcoded cells does not change much between Day 0 and Day 3, we took the Day 3 distribution and scaled it to the expected number of initially labeled cells. Note that, in doing so, we only relied on the clones for which we have cells at all time points. Consequentially, some meta-clones were removed from our analysis, since none of their clones satisfied our requirement.

As a model whose number of parameters is larger than the available data points tends to have a large solution space, an extra data point should make it more constrained (less flexible), and have a smaller estimation error after the optimization process.

Parameter inference

Given a LARRY-based LT-scSeq dataset with noise due to detection or loss of barcodes, we formulate the cell counts at each time point as sampled from a Poisson distribution with means given by the neural network (we found it to be more robust than the commonly used GaussianNLL loss). We minimize the negative log-likelihood loss for each meta-clone separately as follows,

$${y}_{t,p} =\frac{{x}_{t,p}}{{\mu }_{t}}\qquad {\hat{y}}_{t,p}=\frac{{\hat{x}}_{t,p}}{{\mu }_{t}}\\ recon =PoissonNLL({\hat{y}}_{t,p},\,{y}_{t,p})\\ ={\hat{y}}_{t,p}-{y}_{t,p}*log({\hat{y}}_{t,p})+log({y}_{t,p}!)$$

(4)

Whilst the inputs and outputs of the NeuralODE algorithm are based on real cell counts x_t,p to interpolate and mimic the natural development process, our reconstruction loss is based on the original sequenced cell counts y_t,p to avoid making a stiff ODE that is difficult to solve and to speed up the back-propagation process⁵³, as the number of cell counts can easily scale to millions. Besides reconstruction loss, the model also incorporates the penalty terms shown in Eq.(5), where the default values are λ₀, λ₅ = 1.0, λ₁, λ₂, λ₃ = 0.5, and λ₄ = 0.1,

$$l0 ={\lambda }_{0} * \left({\left\Vert {K}_{1}(t)\ > \ 6\right\Vert }_{1}+{\left\Vert {K}_{2}(t)\ > \ 6\right\Vert }_{1}\right)\\ l1 ={\lambda }_{1} * {\left\Vert {K}_{2}(t) * {P}_{apop}\ > \ 0\right\Vert }_{1}\\ l2 ={\lambda }_{2} * \left\{\begin{array}{l}{\left\Vert {K}_{1,2}(t) * ({\sum }_{t\in T}{x}_{t,p}=\!=0)\right\Vert }_{2} \\ {\left\Vert {K}_{1,2}(t) * ({x}_{t,p}=\!=0)\right\Vert }_{2} \hfill\end{array}\right.\\ l3 ={\lambda }_{3} * {\left\Vert {\sum }_{c=1}^{C}{K}_{1,2}{(t)}_{i} * {w}_{c}-{K}_{1,2}{(t)}_{bg}\right\Vert }_{1}\\ l4 ={\lambda }_{4} * {\left\Vert {\sum }_{i=1}^{C}{K}_{1,2}{(t)}_{i} < 0\right\Vert }_{2}\\ l5 ={\lambda }_{5} * {\left\Vert {K}_{2}(t) * {P}_{prol} < 0\right\Vert }_{1}\\ s.t.\,{w}_{c} =\frac{{\sum }_{t}^{T}{\sum }_{p}^{P}{x}_{t,c,p}}{\mathop{\sum }_{t}^{T}\mathop{\sum }_{c}^{C}\mathop{\sum }_{p}^{P}{x}_{t,c,p}}$$

(5)

Specifically, the rationale behind these penalties is:

the transition rates in topology L should not be too large to violate the biological prior knowledge. We used 6 per day here because cell cycle cannot last too quickly (6 per day means 4 h duration on average). However, this is an adjustable parameter at the user’s discretion.
fully differentiated populations should have limited proliferation ability (e.g., Late Erythroids).
theoretically, populations with 0 cell counts at a certain time point should not have the ability to either proliferate or differentiate; however, to account for the possibility of barcode dropouts or sequencing error which mistakenly assigned 0 cell count to a population, a softer penalty is applied to tackle this issue (besides the major recon loss).
the weighted mean of the estimated rates for each meta-clone should mimic the dynamics of background cells. Here, “background” stands for all available cells, independent of the barcode presence/quality.
we hypothesize that the apoptosis process should not be too quick in a homeostatic environment.
early progenitor populations should have an overall positive net growth rate (e.g., HSCs).

We used L1 norm for most penalties due to its ability of selecting non-zero parameters, except for l2 and l4, where L2 norm was applied to make the penalty less stringent. For l2, considering an ideal scenario where cell counts sequenced at each population and time point are 100% accurate, then the population with 0 counts should have neither proliferation nor differentiation ability. However, in reality, loss of barcodes or sequencing error at each time point could introduce extra dropouts to cell counts, especially for small-sized clones, making the data harder to analyze (e.g., in the cord blood dataset, meta-clone 1 does not have any HSC/MPP 1 or HSC/MPP 2, whilst most of the progenies exist). Using the aforementioned technique, CLADES has the ability to counteract this negative effect and automatically interpolate cell counts to recover a smoothed trajectory. For l4, our intention was to introduce moderate constraints to the model in terms of suppressing the negative transition rates, which would increase the model’s flexibility when facing complicated systems.

The overall cost is the sum of both penalty terms and reconstruction loss at each time point t, clone c, population p, respectively

$$loss=\mathop{\sum }_{t}^{T}{\sum }_{c}^{C}{\sum }_{p}^{P}\left(recon+{\sum }_{i=0}^{5}{l}_{i}\right)$$

(6)

From DNA barcodes to meta-clones

According to the experimental protocol, the inheritable DNA barcodes are induced in undifferentiated populations and their immediate progenies at Day 0. Due to the typical challenges in clonal analysis, e.g., cell dropout, barcode homoplasy or loss of barcodes³⁶, although hundreds of unique clones are captured, a large proportion only contains few cells or appears in limited time points and populations. Therefore, analyzing the behavior of individual clones is infeasible.

It is straightforward to use the pooled information, under the assumption that clones with similar kinetics should in turn produce similar cell counts at a specific time point t for a specific population p. Using cell counts at (t_i, p_i) as features, clones with alike characteristics were then clustered together to form meta-clones using the Leiden clustering method from the scanpy package⁵⁴.

Here we give a detailed step-by-step guideline on how the meta-clones are yielded in our experiments,

1.
The LARRY system induced a few thousand clones (3940 for the cord blood dataset and 5859 for the mouse hematopoiesis dataset), of which clones with cell numbers less than or equal to 2 are filtered out, which leads to 606 clones and 4214 clones respectively;
2.
calculate the principle components of the filtered time-population-dependent barcode matrix using default values
$${\mathtt{sc.tl.pca()}}$$
3.
get the neighbours’ information
$${\mathtt{sc.pp.neighbors()}}$$
4.
get the reduced dimension visualization of the matrix, as an indication of the relationships of each clone
$${\mathtt{sc.tl.umap()}}$$
5.
meta-clones are based on the aggregation of each individual clone; here we used Leiden clustering as the example
$${\mathtt{sc.tl.leiden()}}$$

Whilst the resolution of Leiden clustering is arbitrary, we generally follow the rules that having at least 15 clones per meta-clone (after QC). Given the sparsity nature of the barcoding data, fewer clones in a meta-clone may translate into noisier estimations. The number of meta-clones is a hyper-parameter that can be adjusted to explore the data at different levels of resolutions; in our analysis, the default way is to set the parameter resolution in Leiden clustering equal to 1 (Supplementary Note 5).

Of note, an extra meta-clone named as background was included (BG, including both the barcoded and non-barcoded cells). Meta-clone BG represents the average behavior of the entire system, and the weighted average of the kinetics of all the meta-clones should resemble the dynamic profiles of the meta-clone BG.

Model initialization and training strategies

For the constant mode of CLADES, the Kaiming Uniform⁵⁵ was used as the default to initialize the rates (K_1,2) of the (meta-) clone-specific ODE function,

$$W \sim U\left(-\sqrt{6\,/\,{f}_{in}},\sqrt{6\,/\,{f}_{in}}\right)$$

(7)

where f_in is the dimension of the input layer (number of defined cell populations). For the dynamic mode, since the estimated rates for each meta-clone are given by linear layers with a small input, the weights would be too large if the parameters were initialized by the aforementioned approach because of the inversely proportional property. Therefore, to avoid problems like gradient exploding or numerical overflow which might make the adaptive solver unable to solve, we used standard normal distribution N ~ (0, 0.01) to initialize the weights in MLP as an alternative.

The training strategy also varies for both synthetic and real-world datasets. For the synthetic dataset, the number of training time points used varies for different simulation trials, whilst they all use the same testing set for evaluation, which allows a direct comparison between each trial (Supplementary Table S1 & S2). However, for real-world datasets such as the cord blood or mouse hematopoietic data, the available time points are limited. Therefore, splitting them into training and test sets is not feasible, because that would undermine the model’s ability to learn sufficient patterns, and we used all 3 available time points as the training set.

Nevertheless, the performance of CLADES can still be guaranteed, due to 1) the analysis of synthetic data proved that 3~4 input time points are sufficient to get a fairly accurate predictions; 2) and we have multiple biologically informed penalties to prevent the possibility of over-fitting.

As for the configurations used to train the model: the default epochs are 1500 and for the ODE block of the dynamic mode, softplus activation function and a hidden dimension size of 32 were used. AdamW optimizer with default settings was adopted and the learning rate can be adjusted using either multi-step learning rate decay every 200 epochs with decay rate γ = 0.5, or an automatic way based on the general losses. The initial learning rate for constant mode and dynamic mode are 5e⁻² and 1e⁻³, respectively.

Bootstrapping for model confidence intervals

As described in the parameter inference section, PoissonNLL was used instead of GaussianNLL because it is more robust to sparse data with lineage relationships. However, we could not directly compute the estimation error or 95% CI based on this loss function.

To generate such data and get a comprehensive analysis of the rates, we adopted the bootstrapping strategy, which randomly samples the observation data with replacement M times (the initial time point t₀ is not sampled). Based on the central limit theorem, we get the percentile CI for each transition rate and the quantile estimation error for the total cell counts after ranking the fitted parameters ${\hat{\theta }}_{0},{\hat{\theta }}_{1},...,{\hat{\theta }}_{M-1},{\hat{\theta }}_{M}$ of all bootstrapping trials.

$${{{{\rm{S}}}}}_{i} \sim {{{\rm{Sample}}}}(\forall {x}_{t,p}\in Obs\setminus {x}_{0,p},\,{{{\rm{replace}}}}=True)\\ {\hat{\theta }}_{i,t} =f({{{{\rm{S}}}}}_{i})\\ {{{\rm{CI}}}} \sim {{{\rm{Quantile}}}}({\hat{\theta }}_{0},{\hat{\theta }}_{1},\ldots,{\hat{\theta }}_{M})\\ s.t.\,i =1,2,\ldots,M$$

(8)

For statistical analysis and the comparison of rates within different meta-clones, we mainly used the student t test and Mann Whitney U rank test based on whether the distribution of bootstrapped values follows a standard normal distribution or not. The p-values of multiple tests were corrected using the Benjamini-Hochberg procedure (false discovery rate, FDR-BH) with 0.05 as the threshold.

It is worth noting that, when performing the tests, we took precautions to avoid abnormal significance in the bootstrapping results, especially for populations with zero cell counts. Therefore, we introduced the mean absolute change of parameters as an additional metric to assess the significance between a pair of meta-clones (see Model evaluation metrics section). Furthermore, we decided to deem as non-significant also the rates estimated for cell-populations that have less than 4 cells at the corresponding time point.

The bootstrapping approach might over-estimate the CI given limited available experimental time points; alternative approaches include increasing sequencing times or profiling the likelihoods^56,57 after fitting the model, at the cost of a complicated experimental design or a linearly increased computational complexity.

Model evaluation metrics

The following evaluation metrics were used to help assess the performance and usage of CLADES, for both synthetic and real-world datasets.

Average recovery rate

We used this to demonstrate the model’s ability to interpolate cell counts on unknown time points (for synthetic datasets) and to assess the model’s performance in reconstructing the original cell counts (for real-world datasets) with the following formula,

$$rat{e}_{t}=\left(1-\frac{mean\left(\sum | {x}_{t,c,p}-{\hat{x}}_{t,c,p}| \right)}{mean\left(\sum {x}_{t,c,p}\right)}\right) * 100$$

(9)

where x_t,c,p and ${\hat{x}}_{t,c,p}$ are the observed and predicted cell counts at any given time points, meta-clones, and populations. Since cell counts from different time points usually have distinctive magnitudes (shown in the scaling factors), this metric is often evaluated on a per-time-point basis to ensure the comparability between different trials.

Correlation of cell counts

Whilst the average recovery rate focused on the error of each individual data point, this metric is more coarse-grained, which models the general trends of cell counts given a time stamp and a meta-clone

$$cor{r}_{t,c} =\rho ({x}_{t,c},\,{\hat{x}}_{t,c})\\ s.t.\,\,t \, \ne \, {t}_{0}$$

(10)

as shown in the equation, the correlation is based on the cell counts of available populations. Thus, it is reasonable if the correlation is extremely high, as the number of populations is usually not a large number.

Absolute error of rates

Both evaluation metrics above are focused on the accuracy of predicted cell counts, for each (meta-) clone, the number of descendant cell counts is the product of the orchestrated transition rates between upstream and downstream cell states.

Therefore, it is also useful to investigate the model’s performance in recovering interim parameters.

$$erro{r}_{t,c,p} =| {K}_{t,c,p}-\,{\hat{K}}_{t,c,p}| \\ s.t.\,\,t \in ({t}_{0},{t}_{1},...,{t}_{n})$$

(11)

where K is the transition rate matrix, including both proliferation and differentiation rates.

The absolute error of rates is a more direct indicator of model’s performance, as sometimes the error accumulated in transition rates could not be reflected by the predicted cell counts.

Mean absolute difference of parameters

This is mainly used in the post-processing part of the bootstrapping results, together with statistical tests. As in some cases, the variance of distribution of rates given by bootstrapping is small, and that leads to a significant statistical difference between meta-clones, which might lead to wrong interpretation when the total cell counts are 0.

Suppose we need to compare two rates and from M bootstrapping trials, we can get ${K}_{{\theta }_{i},{c}_{1},{p}_{1},{p}_{2}}$ and ${K}_{{\theta }_{i},{c}_{2},{p}_{1},{p}_{2}}$. Then the mean absolute difference of rates is defined as,

$$fc=\frac{\mathop{\sum }_{i=1}^{M}| {K}_{{\theta }_{i},{c}_{1},{p}_{1},{p}_{2}}-{K}_{{\theta }_{i},{c}_{2},{p}_{1},{p}_{2}}| }{M}$$

(12)

and the default value for this metric is 0.1, so that any difference lower than this threshold would be considered as not significant.

Using Gillespie to simulate differentiation landscapes

We adopted and modified the original Gillespie algorithm to achieve a balance between accuracy and time complexity (Supplementary Note 3, algorithm 1). As t_cut was introduced in the algorithm design to guarantee convergence (e.g., the time increment between two simulation steps Δt = max(Δt, t_cut)), the cell counts generated by the stochastic simulation algorithm do not perfectly resemble the observed counts, whilst this does not affect the division statistics of a progenitor cell to produce a certain progeny and the likelihood of producing a lineage.

The division summary is delineated in the following way: starting from an early progenitor cell at the initial time point (e.g., HSC/MPP 1 or prog_1); the number of proliferation events was counted until the first progeny was produced. This progeny could either be a later progenitor or a specific cell fate (e.g., prog_2 or Erythroids). In some realizations, there were only differentiation events; thus, the number of divisions was 0.

In order to explore the variety of fates produced by different progenitors, we run the simulation 1000 times, starting each time with just a single progenitor cell. The starting progenitor can be any, but we focused on the cell types that were initially barcoded to recapitulate the observation and get more insight about which cell type is producing the different outputs during the time span of the experiment. Note that we consider as “produced” a lineage if there are at least 40 cells simultaneously in that lineage at any point of the simulation.

The original Gillespie algorithm sometimes falls into an infinite loop of choosing the same reaction, due to the extremely small time increments. Though the introduction of t_cut does not affect the analysis mentioned above, we lose other information that the Gillespie algorithm could provide, e.g., analysis of cell count dynamics and statistics of reactions.

Experimental procedures for the LARRY human cord blood dataset

1) LARRY barcoding plasmid propagation and analysis of barcode library diversity using 10X. The LARRY Barcode Version 1 library was a gift from Fernando Camargo (Addgene, #140024). The library was amplified and the lentiviral vector produced according to the associated published protocol²² with some minor adaptations. Briefly, plasmids were introduced into ElectroMAX Stbl4 competent cells (Life Technologies) using a MicroPulser Electroporator (Biorad) and incubated for 1 h at 37 °C before spreading over 24 large Agar + Ampicillin plates. After 24 h at 32 °C, colonies were harvested through scraping using pre-warmed LB medium containing Ampicillin. The resulting culture (approx. 1.5L) was incubated at 37 °C for 2 h before isolating plasmid DNA using Megaprep kits (Machery-Nagel).

A reference library was made through sequencing of PCR-amplified barcodes from the LARRY plasmid library. 10ng of LARRY plasmids was taken as input for a two-step PCR; the first step adds Illumina Read1 and Read2 sequences (5’ACACTCTTTCCCTACACG ACGCTCTTCCGATCTTGTGACGTCACAGGTCGACACCAGTCTCATT3’ and 5’GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCGAGTAACCGTTGCTAGGAGAGACCATA3’). The second step adds the P5 and P7 flow cell attachment sequences and 7bp sample indices (P5 5’AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT3’ and P7 5’CAAGCAGAAGACGGCATACGAGANNNNNNNGTGACTGGAGTTCAGACGTGCTCTTCCGATC3’). 8ng of PCR1 product was taken as input for PCR2. The PCR program used was as follows: 98 °C 2 min, eight cycles of 98 °C 10 s, 58 °C (PCR1), 62.5 °C (PCR2) 20 s, 72 °C 30 s, followed by final elongation 72 °C 5 min and 4 °C indefinitely. In between PCR1 and PCR2, PCR purification was performed using the QIAquick PCR purification kit (Qiagen). Purification of PCR2 product was carried out using Ampure XP beads (Beckman Colter) before sequencing on a Novaseq instrument at the Cancer Research UK Cambridge Institute Genomics facility. The resulting list of LARRY barcodes was used as a reference list in downstream analysis.

2) LARRY library lentivirus preparation. Lentiviral vector was produced by transforming the amplified LARRY and packaging plasmids (psPAX2 and pMD2.G) into HEK293T cells using the TransIT-LT1 transfection reagent (Mirus) and incubated at 37 °C. Twenty-four hours after transfection, 500μl of 0.5 mM Sodium Butyrate (Sigma-Aldrich) was added to the cells and the culture continued at 37 °C incubation. LARRY lentivirus was harvested 48 h after transfection, filtered through a 0.45 μm PES filter (Whatman) and concentrated 100-fold by centrifugation in a Beckman Colter Optima XPM-80 ultracentrifuge at 20,000 x g for 2 h at 4 °C. The lentivirus was aliquoted and stored at −80 °C. Titration of the lentiviral barcoding library was performed on HEK293T cells, with a read out obtained by flow cytometry 3 days after infection.

3) Cord blood samples. Umbilical CB samples were obtained with informed consent from healthy donors by Cambridge Blood and Stem Cell Biobank (CBSB) in accordance with regulated procedures approved by the relevant Research and Ethics Committees (18/EE/0199 and 24/EE/0116 Research Studies). MNCs were obtained using Pancoll density gradient centrifugation of diluted (1:1 with PBS) CB. Red cells were lysed before positive selection for CD34⁺ cells using the micro beads CD34⁺ selection kit and AutoMACS cell separation technology (Miltenyi Biotech). CB CD34⁺ cells were then stored at −150 °C until use in experiments.

4) FACS. To sort CD34⁺ cells for experiments from CB samples, the cells were thawed by dropwise addition of pre-warmed Rich Thawing Medium containing Iscove’s Modified Dulbecco’s Medium (IMDM, Life Technologies), 0.1 mg/ml DNase (Lorne Laboratories) and 50% Fetal Bovine Serum (FBS, Life Technologies) before centrifugation (500 x g 5 min) and resuspension in PBS + 3% FBS. The samples were stained with antibodies (Table 1).

Table 1 Antibody and sample statistics used in the experiments

Full size table

Cells were incubated for 20 min at room temperature and washed with PBS + 3% FBS. DAPI (Biolegend) was added 1 in 100 (final concentration 137 μg/ml) to identify viable cells. Unstained cells and compensation beads (Invitrogen) were used for compensation and as controls to set appropriate gates. Lineage⁻ CD11c⁻ CD34⁺ cells were sorted into Eppendorf tubes using a BD Influx approved for CL2-sorting.

5) In vitro barcoding, culture and sampling for 10X. The cord blood CD34⁺ cells were cultured following the reference conditions⁵⁸ (Tables 2 & 3), that promote the differentiation of cells towards Megakaryocyte, Erythroid and Myeloid lineages (MEM) with reduced levels of EPO cytokine to avoid biasing the culture towards the erythroid lineage.

Table 2 Conditions for culturing the CD34⁺ cells

Full size table

Table 3 Conditions for culturing the CD34⁺ cells (cont'd)

Full size table

For barcode labeling, 75,000 sorted Lineage⁻ CD11c⁻ CD34⁺ cells were seeded at 5000 cells per well in a 96-well round-bottom plate. LARRY lentivirus was added directly to the culture at an MOI of 60. After 24 h, the virus was diluted out and cells transferred to a 96-well flat-bottom plate. At 3 days post-transduction, GFP⁺ and GFP⁻ cells were FACS-sorted. Two thirds of the GFP⁺ fraction (8856 cells) was sent for scRNA-seq analysis (10X genomics) and the remaining third (4428 cells) was re-plated in 500 μl reduced EPO MEM culture conditions in a 24-well plate. GFP⁻ cells were also analyzed by scRNA-seq. At day 10 post-transduction cells were stained for GlyA (PE, 1:1000 dilution, BD) using the protocol above. GFP⁺ GlyA⁻ cells were FACS-sorted, with 40,000 cells processed for scRNA-seq and the remainder replated, as previously. At day 17 post-transduction, 40,000 GFP^hi GlyA⁻ and 40,000 GFP^mid GlyA⁻ cells were FACS-sorted and processed for scRNA-seq.

6) Sample preparation and LARRY cDNA enrichment for scRNAseq (10X Genomics). Up to 20,000 live cells of interest were sorted into 300μl MEM media and kept on ice before centrifugation. Cells were resuspended in PBS + 0.04% BSA (miltenyi) and further processed for single cell sequencing using the Chromium Single Cell 3’ Library & Gel Bead Kit v3 (10X Genomics) following manufacturer’s protocols. LARRY molecules constitute a small fraction of the total cDNA library; thus, we adapted previously published protocols^22,59 to enrich the LARRY barcodes and ensure a sufficient number of reads per barcode. Using a portion of each sample’s cDNA, LARRY sequences were PCR-amplified while simultaneously adding Illumina primers and indices necessary for sequencing and identifying the enriched samples in downstream analysis (Supplementary Table S9). Purification of the PCR product was carried out using Ampure XP beads (Beckman Colter) before sequencing alongside the full sample cDNA library on a NovaSeq 6000 at the Cancer Research UK Cambridge Institute Genomics facility.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

CLADES used two public datasets to demonstrate its applicability and the downstream functions: 1) a newly generated LARRY-based human cord blood LT-scSeq data, and the raw data is deposited under GEO with accession number GSE276896. The scripts for pre-processing the data can be found here. Processed adata, and files for model inputs are stored and can be accessed via https://figshare.com/articles/dataset/CLADES/27908142 with DOI number https://doi.org/10.6084/m9.figshare.27908142; 2) the mouse hematopoietic system is a publicly available dataset^22,36, and we used the same data as described in the literatures, however with a more fine-grained annotation of cell populations. Relevant data can be accessed via the same figshare repository as well. We have deposited all the synthetic datasets used in this project into https://figshare.com/articles/dataset/CLADES/27908142 as well. The scripts for generating Supplementary Figs. can be found in several different notebooks within the GitHub repository https://github.com/StatBiomed/clonaltrans, covering both the data preprocessing and downstream analysis parts.

Code availability

CLADES is a Python package and is publicly available at https://github.com/StatBiomed/clonaltrans with both constant and dynamic modes implemented. It also contains documentation, pipelines, and Jupyter Notebooks to reproduce figures and results mentioned in this paper. The specific version used for this manuscript is v1.2.0, with Zenodo DOI identifier https://doi.org/10.5281/zenodo.15811408⁶⁰.

References

Kester, L. & Van Oudenaarden, A. Single-cell transcriptomics meets lineage tracing. Cell Stem Cell 23, 166–179 (2018).
Article PubMed CAS Google Scholar
Woodworth, M. B., Girskis, K. M. & Walsh, C. A. Building a lineage from single cells: genetic techniques for cell lineage tracking. Nat. Rev. Genet. 18, 230–244 (2017).
Article PubMed PubMed Central CAS Google Scholar
Wagner, D. E. & Klein, A. M. Lineage tracing meets single-cell omics: opportunities and challenges. Nat. Rev. Genet. 21, 410–427 (2020).
Article PubMed PubMed Central CAS Google Scholar
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Article PubMed CAS Google Scholar
Setty, M. et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 37, 451–460 (2019).
Article PubMed PubMed Central CAS Google Scholar
Qiu, X. et al. Mapping transcriptomic vector fields of single cells. Cell 185, 690–711 (2022).
Article PubMed PubMed Central CAS Google Scholar
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Article PubMed PubMed Central CAS Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Article ADS PubMed PubMed Central CAS Google Scholar
La Manno, G. et al. Rna velocity of single cells. Nature 560, 494–498 (2018).
Article ADS PubMed PubMed Central Google Scholar
Bergen, V., Soldatov, R. A., Kharchenko, P. V. & Theis, F. J. RNA velocity-"current challenges and future perspectives. Mol. Syst. Biol. 17, e10282 (2021).
Article PubMed PubMed Central CAS Google Scholar
Gao, M., Qiao, C. & Huang, Y. Unitvelo: temporally unified RNA velocity reinforces single-cell trajectory inference. Nat. Commun. 13, 6586 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Erhard, F. et al. Time-resolved single-cell RNA-seq using metabolic RNA labelling. Nat. Rev. Methods Prim. 2, 77 (2022).
Article CAS Google Scholar
Qiu, Q. et al. Massively parallel and time-resolved RNA sequencing in single cells with SCNT-seq. Nat. Methods 17, 991–1001 (2020).
Article PubMed PubMed Central CAS Google Scholar
Maizels, R. J., Snell, D. M. & Briscoe, J. Reconstructing developmental trajectories using latent dynamical systems and time-resolved transcriptomics. Cell Syst. 15, 411–424 (2024).
Article PubMed CAS Google Scholar
Chen, C., Liao, Y. & Peng, G. Connecting past and present: single-cell lineage tracing. Protein Cell 13, 790–807 (2022).
Article PubMed PubMed Central Google Scholar
Ludwig, L. S. et al. Lineage tracing in humans enabled by mitochondrial mutations and single-cell genomics. Cell 176, 1325–1339 (2019).
Article PubMed PubMed Central CAS Google Scholar
Kwok, A. W. C. et al. Mquad enables clonal substructure discovery using single cell mitochondrial variants. Nat. Commun. 13, 1205 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Huang, R. et al. Robust analysis of allele-specific copy number alterations from scRNA-seq data with XClone. bioRxiv 15, 6684 (2023).
Xue, Y., Su, Z., Lin, X., Ho, M. K. & Yu, K. H. Single-cell lineage tracing with endogenous markers. Biophys. Rev. 16, 125–139 (2024).
Chung, H. M. & Huang, Y. Interpretable variational encoding of genotypes identifies comprehensive clonality and lineages in single cells geometrically. bioRxiv 2024–07 (2024).
Guo, C. et al. Celltag indexing: genetic barcode-based sample multiplexing for single-cell genomics. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Weinreb, C., Rodriguez-Fraticelli, A., Camargo, F. D. & Klein, A. M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 367, eaaw3381 (2020).
Article PubMed PubMed Central CAS Google Scholar
Sashittal, P., Schmidt, H., Chan, M. & Raphael, B. J. Startle: a star homoplasy approach for CRISPR-Cas9 lineage tracing. Cell Syst. 14, 1113–1121 (2023).
Article PubMed PubMed Central CAS Google Scholar
Spanjaard, B. et al. Simultaneous lineage tracing and cell-type identification using CRISPR–Cas9–induced genetic scars. Nat. Biotechnol. 36, 469–473 (2018).
Article PubMed PubMed Central CAS Google Scholar
Lu, R., Neff, N. F., Quake, S. R. & Weissman, I. L. Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding. Nat. Biotechnol. 29, 928–933 (2011).
Article PubMed PubMed Central CAS Google Scholar
Naik, S. H. et al. Diverse and heritable lineage imprinting of early haematopoietic progenitors. Nature 496, 229–232 (2013).
Article ADS PubMed CAS Google Scholar
Bandler, R. C. et al. Single-cell delineation of lineage and genetic identity in the mouse brain. Nature 601, 404–409 (2022).
Article ADS PubMed CAS Google Scholar
Ratz, M. et al. Clonal relations in the mouse brain revealed by single-cell and spatial transcriptomics. Nat. Neurosci. 25, 285–294 (2022).
Article PubMed PubMed Central CAS Google Scholar
Biddy, B. A. et al. Single-cell mapping of lineage and identity in direct reprogramming. Nature 564, 219–224 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Rodriguez-Fraticelli, A. E. et al. Single-cell lineage tracing unveils a role for tcf15 in haematopoiesis. Nature 583, 585–589 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Merino, D. et al. Barcoding reveals complex clonal behavior in patient-derived xenografts of metastatic triple negative breast cancer. Nat. Commun. 10, 766 (2019).
Article ADS PubMed PubMed Central CAS Google Scholar
Rodriguez-Meira, A. et al. Single-cell multi-omics identifies chronic inflammation as a driver of tp53-mutant leukemic evolution. Nat. Genet. 55, 1531–1541 (2023).
Fischer, D. S. et al. Inferring population dynamics from single-cell RNA-sequencing time series data. Nat. Biotechnol. 37, 461–468 (2019).
Article PubMed PubMed Central CAS Google Scholar
Forrow, A. & Schiebinger, G. Lineageot is a unified framework for lineage tracing and trajectory inference. Nat. Commun. 12, 4940 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943 (2019).
Article PubMed PubMed Central CAS Google Scholar
Wang, S.-W., Herriges, M. J., Hurley, K., Kotton, D. N. & Klein, A. M. Cospar identifies early cell fate biases from single-cell transcriptomic and lineage information. Nat. Biotechnol. 40, 1066–1074 (2022).
Article PubMed CAS Google Scholar
Kucinski, I. et al. A time-and single-cell-resolved model of murine bone marrow hematopoiesis. Cell Stem Cell 31, 244–259 (2024).
Article PubMed PubMed Central CAS Google Scholar
Tong, A. et al. Learning transcriptional and regulatory dynamics driving cancer cell plasticity using neural ODE-based optimal transport. bioRxiv 2023-03 (2023).
Wojtowicz, E. E. et al. Panhematopoietic RNA barcoding enables kinetic measurements of nucleate and anucleate lineages and the activation of myeloid clones following acute platelet depletion. Genome Biol. 24, 152 (2023).
Article PubMed PubMed Central CAS Google Scholar
Wang, S.-W. Methyltree: exploring epimutations for accurate and non-invasive lineage tracing. Nat. Methods 22, 463–464 (2025).
Article Google Scholar
Chen, R. T., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31, 1–13 (2018).
Lu, T., Volfson, D., Tsimring, L. & Hasty, J. Cellular growth and division in the Gillespie algorithm. Syst. Biol. 1, 121–128 (2004).
Article CAS Google Scholar
Wolf, F. A. et al. Paga: graph abstraction reconciles clustering with trajectory inference through a topology-preserving map of single cells. Genome Biol. 20, 1–9 (2019).
Article Google Scholar
Waldherr, S. Estimation methods for heterogeneous cell population models in systems biology. J. R. Soc. Interface 15, 20180530 (2018).
Article PubMed PubMed Central Google Scholar
Charlebois, D. A. & Balázsi, G. Modeling cell population dynamics. Silico Biol. 13, 21–39 (2019).
Article Google Scholar
Stumpf, P. S. et al. Stem cell differentiation as a non-Markov stochastic process. Cell Syst. 5, 268–282 (2017).
Article PubMed PubMed Central CAS Google Scholar
Takahashi, M. et al. Reconciling flux experiments for quantitative modeling of normal and malignant hematopoietic stem/progenitor dynamics. Stem Cell Rep. 16, 741–753 (2021).
Article CAS Google Scholar
Bowling, S. et al. An engineered CRISPR-Cas9 mouse line for simultaneous readout of lineage histories and gene expression profiles in single cells. Cell 181, 1410–1422 (2020).
Article PubMed PubMed Central CAS Google Scholar
Miller, T. E. et al. Mitochondrial variant enrichment from high-throughput single-cell RNA sequencing resolves clonal populations. Nat. Biotechnol. 40, 1030–1034 (2022).
Article PubMed PubMed Central CAS Google Scholar
Marjanovic, N. D. et al. Emergence of a high-plasticity cell state during lung cancer evolution. Cancer Cell 38, 229–246 (2020).
Article PubMed PubMed Central CAS Google Scholar
Popescu, D.-M. et al. Decoding human fetal liver haematopoiesis. Nature 574, 365–371 (2019).
Article ADS PubMed PubMed Central CAS Google Scholar
Orfao, A. & Ruiz-Argüelles, A. General concepts about cell sorting techniques. Clin. Biochem. 29, 5–9 (1996).
Article PubMed CAS Google Scholar
Kim, S., Ji, W., Deng, S., Ma, Y. & Rackauckas, C. Stiff neural ordinary differential equations. Chaos Interdiscip. J. Nonlinear Sci. 31, 093122 (2021).
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Article Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proc. IEEE International Conference on Computer Vision, 1026–1034 (IEEE, 2015).
Meeker, W. Q. & Escobar, L. A. Teaching about approximate confidence regions based on maximum likelihood estimation. Am. Stat. 49, 48–53 (1995).
Article Google Scholar
Raue, A. et al. Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25, 1923–1929 (2009).
Article PubMed CAS Google Scholar
Belluschi, S. et al. Myelo-lymphoid lineage restriction occurs in the human haematopoietic stem cell compartment before lymphoid-primed multipotent progenitors. Nat. Commun. 9, 4100 (2018).
Article ADS PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Article PubMed PubMed Central CAS Google Scholar
Gao, M. et al. A hybrid neuralode-gillespie approach for unveiling clonal cell fate and differentiation dynamics. StatBiomed clonaltrans https://doi.org/10.5281/zenodo.15811408 (2025).

Download references

Acknowledgements

We kindly thank cord blood donors for their tissue donation, Dr. Joanna Baxter and the team of the Cambridge Blood and Stem Cell Biobank for consenting and collecting cord blood samples. We also thank Reiner Schulte and Gabriela Grondys-Kotarba at the Cambridge Institute for Medical Research Flow Cytometry Core Facility for flow cytometry cell sorting, and the CRUK Cambridge Institute Genomics Center for sequencing. This project is supported by the National Natural Science Foundation of China (No. 62222217), the University of Hong Kong through a startup fund and a seed fund (Y.H.), and the InnoHK initiative of the Innovation and Technology Commission of the Hong Kong Special Administrative Region Government. This project is also funded by Wellcome (215116/Z/18/Z; E.L. and B.G.) and previous core support grants from Wellcome and Medical Research Council (MRC) to the Wellcome-MRC Cambridge Stem Cell Institute (203151/Z/16/Z). E.L. was supported by a Wellcome - Royal Society Sir Henry Dale Fellowship 107630/Z/15/Z. M.H. is supported by a Sir Henry Wellcome Postdoctoral Fellowship (224055/Z/21/Z). M.B. is funded in part also by the Wellcome Trust [226795/Z/22/Z] This research was funded in whole, or in part, by the Wellcome Trust. For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising from this submission.

Author information

These authors contributed equally: Mingze Gao, Melania Barile.

Authors and Affiliations

School of Biomedical Sciences, The University of Hong Kong, Hong Kong SAR, China
Mingze Gao, Yiming Chao & Yuanhua Huang
InnoHK-Centre for Translational Stem Cell Biology, Hong Kong Science Park, Hong Kong SAR, China
Melania Barile, Yiming Chao, Berthold Göttgens & Yuanhua Huang
Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
Melania Barile, Shirom Chabra, Myriam Haltalli, Emily F. Calderbank, Weizhong Zheng, Nicola K. Wilson, Elisa Laurenti & Berthold Göttgens
Department of Haematology, University of Cambridge, Cambridge, UK
Melania Barile, Shirom Chabra, Myriam Haltalli, Emily F. Calderbank, Weizhong Zheng, Nicola K. Wilson, Elisa Laurenti & Berthold Göttgens
Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China
Yuanhua Huang

Authors

Mingze Gao
View author publications
Search author on:PubMed Google Scholar
Melania Barile
View author publications
Search author on:PubMed Google Scholar
Shirom Chabra
View author publications
Search author on:PubMed Google Scholar
Myriam Haltalli
View author publications
Search author on:PubMed Google Scholar
Emily F. Calderbank
View author publications
Search author on:PubMed Google Scholar
Yiming Chao
View author publications
Search author on:PubMed Google Scholar
Weizhong Zheng
View author publications
Search author on:PubMed Google Scholar
Nicola K. Wilson
View author publications
Search author on:PubMed Google Scholar
Elisa Laurenti
View author publications
Search author on:PubMed Google Scholar
Berthold Göttgens
View author publications
Search author on:PubMed Google Scholar
Yuanhua Huang
View author publications
Search author on:PubMed Google Scholar

Contributions

B.G., M.B., E.L., and Y.H. conceived the project. M.G. developed the algorithm, implemented the package CLADES, and analyzed the data with Y.H., M.B., and W.Z.’s support. S.C., M.H., and E.C. generated the cord blood dataset. M.B. processed the data with S.C.’s and Y.C.’s support. M.G., M.B., and Y.H. wrote the manuscript with inputs from all authors. N.W. and E.L. edited the manuscript and contributed to the biological interpretation of the data.

Corresponding authors

Correspondence to Elisa Laurenti, Berthold Göttgens or Yuanhua Huang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

: Nature Communications thanks Shou-Wen Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, M., Barile, M., Chabra, S. et al. CLADES: a hybrid NeuralODE-Gillespie approach for unveiling clonal cell fate and differentiation dynamics. Nat Commun 16, 8174 (2025). https://doi.org/10.1038/s41467-025-63150-6

Download citation

Received: 23 July 2024
Accepted: 11 August 2025
Published: 01 September 2025
Version of record: 01 September 2025
DOI: https://doi.org/10.1038/s41467-025-63150-6

This article is cited by

TemporalVAE: atlas-assisted temporal mapping of time-series single-cell transcriptomes during embryogenesis
- Yijun Liu
- Fangxin Cai
- Yuanhua Huang
Nature Cell Biology (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

ODE function for (meta-) clone-specific dynamics

Performance and robustness of CLADES on synthetic datasets

Characterizing the (meta-) clone-specific kinetic rates in human cord blood

Resolving cell division history and fate realization of progenitor cells

CLADES recapitulates the cellular dynamics of murine hematopoiesis

Clonal kinetics and characteristics can be inferred from early coordinated gene signatures

Discussion

Methods

Ethical Statement

Parameter definitions and data structures

Annotation of the human cord blood dataset

NeuralODE-based architecture

Scaling factor for experimental cell counts

Initial conditions of the NeuralODE framework

Parameter inference

From DNA barcodes to meta-clones

Model initialization and training strategies

Bootstrapping for model confidence intervals

Model evaluation metrics

Average recovery rate

Correlation of cell counts

Absolute error of rates

Mean absolute difference of parameters

Using Gillespie to simulate differentiation landscapes

Experimental procedures for the LARRY human cord blood dataset

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links