Introduction

Bacterial growth dynamics is a fundamental phenomenon of living systems1,2,3, which typically follows a temporal progression of lag, exponential, and stationary phases4,5. Theoretical models (e.g., Gompertz, Logistic)6,7,8 and computational tools9,10,11 were commonly used to discover bacterial growth laws and underlying mechanisms2,12,13. To quantitatively evaluate bacterial growth dynamics under different nutritional conditions and across different genetic backgrounds, single growth parameters were employed, i.e., lag time (t), growth rate (r), and carrying capacity (K), corresponding to the three growth phases14,15.

However, the growth curve, representing the temporal changes in population size, captures a comprehensive phenotype of a living system that cannot be fully described by just three growth parameters8,16. Most studies of growth curves have relied on theoretical analyses and simulations17,18,19, typically assuming S-shaped curves. Nevertheless, multiphasic or irregularly shaped growth curves are frequently observed under environmental changes, stress, or when mixed substrates are present20,21. For example, bacteria exhibit diauxic growth on glucose-lactose media22,23, demonstrating that the shape of the growth curve reflects phase-regulation signals (e.g., CRP-cAMP regulation)12,24 and survival strategies25,26. Genetic and environmental variations in bacterial populations result in significant diversity in growth curves27,28, highlighting the need for direct analysis of entire growth curves rather than relying solely on single growth parameters.

Direct analyses of bacterial growth dynamics have been limited by a lack of datasets containing growth curves associated with both genetic and environmental variations, as well as by computational challenges in analyzing entire growth patterns rather than single parameters. Systematic growth assays involving genome reduction and medium alteration have successfully linked growth parameters to genetic and environmental changes14,15,29. These studies demonstrated not only the establishment of high-throughput experimental approaches but also the feasibility of linking growth curves to genetic and environmental diversity. Computational algorithms for time-series analysis, such as dynamic time warping (DTW)30,31, can be applied to bacterial growth curves. DTW evaluates the similarity between two time series based on their overall shape through elastic alignment, rather than simple pointwise differences32. It has been reported that thousands of bacterial growth curves can be clustered into groups according to their shapes using DTW33, indicating that DTW is a practical method for evaluating bacterial population dynamics.

In this study, we addressed these challenges by applying data mining techniques to link genomic and environmental variations with bacterial population dynamics. Previous work has shown that bacterial growth rates decrease significantly with genome reduction14, change in response to medium variation15,29, and are epistatically influenced by interactions between genetic and environmental factors34,35. Here, we investigated how such genomic and environmental variations influence bacterial population dynamics by analyzing growth curves obtained from strains with varied genomes grown in a range of media. Comprehensive data mining was performed to elucidate the respective impacts of genetic and environmental factors on bacterial growth dynamics and to clarify the principles underlying their interplay.

Results

Bacterial growth dynamics of genomic and environmental variations

A time series dataset of bacterial growth curves, i.e., temporal changes in the cell concentrations represented by optical density (98 time points per curve), was obtained in well-controlled laboratory conditions. Three growth parameters, i.e., lag time (t), growth rate (r), and carrying capacity (K), could be calculated from the growth curves to represent the three growth phases (Fig. 1A). To evaluate the genomic impacts on bacterial growth dynamics, five Escharichia coli (E. coli) strains with different genome sizes, roughly corresponding to the number of genes, were included, i.e., the wild-type genome W3110 and its derivatives carrying reduced genomes of varied sizes (Fig. 1B). Although genetic variation was more often considered as genetic mutations or gene deletions, the present study aimed to find a quantitative relationship between the genetic factor and bacterial growth; the reduced genomes were adopted. To identify the environmental impacts, 29 growth media were used (Fig. S1, Table S1). These media were based on a minimal medium, which consisted of seven pure compounds (Table S1). Using a limited number of medium components allowed for an equivalent variety of genetic and environmental variables. The media commonly comprised eight chemical components but varied in the concentrations at a logarithmic scale (Figs. 1C and S1). Every six biological replications were performed for individual conditions (per genome per medium), leading to a total of 870 growth curves (Figs. 1D and S2, Table S2). As a pilot study, this dataset linking the genome size and chemical components to bacterial growth curves enabled the investigation into how bacterial population dynamics were impacted by genomic and environmental variation.

Fig. 1
figure 1

Bacterial growth dynamics of varied genome sizes and media. (A) Schematic drawing of bacterial growth curve. The three growth parameters, lag (t), exponential rate (r), and carrying capacity (K), are indicated in the three growth phases. (B) Genomic variation. Five E. coli strains are indicated as N0, N7, N14, N20, and N28, of which the genome sizes are indicated in the colored circles. (C) Environmental variation. Eight chemical components are shown with their concentrations in the 29 media. Open circles and red lines represent the individual and the mean concentrations of each chemical component, respectively. (D) Growth curves. A total of 870 growth curves, temporal changes in OD600, are shown, in which color variation indicates the five E. coli strains, with respect to B.

Decision-making factors for bacterial growth parameters

To evaluate the genomic and environmental impact on different growth phases (Fig. 1A), the three parameters, i.e., lag time (t), growth rate (r), and carrying capacity (K), were calculated from individual growth curves (Table S3). The machine learning algorithm gradient boosting decision tree (GDBT) was used to predict the importance of the genome size and the chemical concentrations to the three parameters. The results showed that the genome size ranked as the most influential factor for all three parameters (Fig. 2A), indicating that the genomic information played a determinant role in all three growth phases. The SHAP (SHapley Additive exPlanations) evaluation34,36 was also used to clarify the output of the GBDT model prediction. The results further confirmed that genome size was the most important factor influencing the three growth parameters (Fig. 2B). Both positive and negative contributions of genome size to the growth parameters were identified, that is, genome reduction caused decreases in r and K but an increase in t (Figs. 2B and S3A). It was somehow biologically reasonable, if genome reduction was a damage for E. coli, resulting in a delayed and slow growth with poor resource utilization. The secondary influencers were differentiation in response to the growth phases, e.g., glucose for K and t, and PO43− for r (Fig. 2A). The order of chemical impacts on growth parameters differed slightly between GBDT and SHAP evaluations (Fig. 2B), likely due to the considerable variation in growth parameters across all media (Fig. S3B). This suggests that all growth parameters are sensitive to chemical components, which act as nutritional resources or regulatory signals. Therefore, relying on single parameters may fail to capture the full dynamic characteristics of bacterial growth.

Fig. 2
figure 2

Genomic and environmental contributions to bacterial growth parameters. (A) Feature importance of genomic and environmental factors. The feature importance of genome size and chemical components was predicted using GBDT, which was conducted five times by random seeds. The standard errors of five repeated predictions are indicated. (B) SHAP evaluation. Each point denotes the SHAP value of a feature for an individual sample. The color gradation from red to blue indicates the feature value from high to low. The top, middle, and bottom panels show the three growth parameters, growth rate (r), carrying capacity (K), and lag time (t), respectively.

Categorization of bacterial growth dynamics according to the curve shape

All 870 growth curves (Fig. 3A, left) were subjected to the time series analysis using dynamic time warping (DTW) to evaluate the similarity of the shapes of the growth curves (Fig. 3A, middle). Since DTW compares the overall temporal dynamics of two time series while allowing for non-linear alignment along the time axis, it effectively fits various growth curves across different experimental and recording time scales. Growth curves with similar shapes, such as lag, exponential, and stationary patterns, can be identified as similar even if the onset or duration of growth phases is shifted. According to the DTW distances, the growth curves were subsequently categorized into varied clusters with multiple clustering methods and statistical metrics (Fig. 3A, middle). The genomic and environmental impacts on the clusters were finally evaluated as to growth parameters (Fig. 3A, right).

Fig. 3
figure 3

Clustering of growth curves based on shape. (A) Flowchart of comprehensive data mining. The process of data collection, DTW calculation, Clustering, and Evaluation is illustrated. (B) Heatmap of the DTW distance matrix computed from 870 growth curves. Both axes represent individual curves. The color scale from red to light blue indicates pairwise DTW distances from high to low. (C) Line plots of Z-scores versus the number of clusters. Color variation shows the four clustering methods: spectral clustering, K-means clustering, hierarchical clustering, and GMM. Shaded bands depict 95% confidence intervals estimated from three evaluation metrics: Silhouette Score, Calinski–Harabasz Index, and Davies–Bouldin Index.

Pairwise DTW distances were calculated for all 870 growth curves, each comprising 98 time points (Tables S1 and S4). Overall, the growth curves exhibited low similarity, with no clear clustering structure or regular distribution (Fig. 3B). To ensure reliable categorization and avoid algorithmic bias, four clustering methods were tested: hierarchical clustering, K-means clustering, spectral clustering, and Gaussian mixture model (GMM). To statistically evaluate these methods, three metrics were used: silhouette score, Davies–Bouldin index (DBI), and Calinski–Harabasz (CH) index. Since these metrics sometimes yielded conflicting evaluations (Fig. S4), Z-score normalization was applied to create a comprehensive clustering quality index. An exhaustive Z-score analysis was performed for cluster numbers ranging from 2 to 50 (Table S5). The results showed that spectral clustering performance decreased as the number of clusters increased, while the other three methods maintained relatively consistent performance (Fig. 3C). Across all four methods, the best clustering was achieved with two clusters, indicating that the growth curves could be broadly divided into two dynamical patterns, regardless of genomic and environmental variations.

Decision-making factors for bacterial growth dynamics

The contributions of the chemical components and genome size to two dynamical patterns were predicted. The GBDT models were trained using the clusters of growth curves as the target variable. Although all four methods resulted in two clusters with the best statistical significance, a few growth curves in the two clusters were different (Table S6). The model training and prediction were performed in response to the four clustering methods separately. Interestingly, the results showed that chemical components, rather than genome size, were the primary determinants of the growth curve clusters (Fig. 4A). Glucose, sulfate (SO42−), and phosphate (PO43−) emerged as key factors. SHAP evaluation further confirmed that glucose was consistently the most important determinant for growth patterns across clustering methods (Fig. 4B). These findings suggest that chemical components are the primary drivers of bacterial growth dynamics. The strong impact of glucose is biologically reasonable, given that it was the sole carbon source in this study (Fig. 1C, Table S1). However, it was unexpected that genome size had a lesser impact on bacterial population dynamics, despite its central biological role.

Fig. 4
figure 4

Genomic and environmental contributions to bacterial growth dynamics. (A) Feature importance of genomic and environmental factors. GBDT classifiers were trained to predict two clusters of growth curves. The feature importance of genome size and chemical components was predicted using GBDT, which was conducted five times by randomly splitting the data into five folds. The standard errors of five repeated predictions based on five-fold cross-validation are indicated. (B) SHAP evaluation. Each point denotes the SHAP value of a feature for an individual growth curve. The color gradation from red to blue indicates the feature value from high to low. Four panels show the four clustering methods: spectral clustering, K-means clustering, hierarchical clustering, and GMM.

To evaluate how effectively the growth curves were separated into two clusters, multidimensional scaling (MDS) embedding of DTW was used for dimensionality reduction. In response to spectral clustering, all 870 growth curves were clearly divided into two distinct clusters without overlaps (Fig. 5A). The corresponding dynamic barycenter averaging (DBA) prototype curves highlighted the differences in average growth dynamics (Fig. 5B). Further analysis using K-means clustering, which showed the second-best clustering performance, divided the growth curves into four clusters (Table S5). The large cluster (Fig. 5A, Cluster 0) was subdivided into three more clusters with smooth boundaries (Fig. 5C). The DBA prototype curves for these separated clusters demonstrated notable differences in growth dynamics (Fig. 5D). Growth curves were assigned to clusters determined by spectral clustering (two clusters; Fig. S5) and K-means clustering (four clusters; Fig. S6). The distribution of growth curves across these clusters revealed that some groups corresponded to specific genome sizes, such as N14 and N20 (Fig. 5E). This suggests that increasing the number of clusters may help to identify the influence of genomic factors on growth dynamics. In-depth visualization of the cluster memberships assigned by the four methods showed that the cluster structures remained consistent across the methods, with only a few differences among clusters (Fig. S7), although finer partitions were likely more sensitive to the choice of clustering method. While dividing bacterial growth into two clusters is statistically justified, finer clustering could yield more meaningful biological insights.

Fig. 5
figure 5

Growth curves categorized in varied clusters. (A) Two-dimensional MDS embedding of DTW distances. The first two MDS dimensions of the two clusters determined by spectral clustering are shown. Each dot indicates a single growth curve. Green and orange denote the two clusters. (B) Averaged trajectories of the two growth curve clusters in A. Lines represent DBA prototypes within the clusters. Transparent shaded bands depict the pointwise min–max envelope of all curves in the corresponding cluster, where each curve was trimmed to the first 90% of time points, and the lower/upper bounds were computed as per-time-point minima and maxima. (C) Two-dimensional MDS embedding of DTW distances. The first two MDS dimensions of the four clusters determined by K-means clustering are shown. Each dot indicates a single growth curve. Color variation denotes the four clusters. (D) Averaged trajectories of the four growth curve clusters in C. Lines and transparent bands are as described in C. (E) Cluster assignment of growth curves across genomes and media. The upper and bottom heatmaps show the distribution of growth curves assigned to two and four clusters, respectively. Five strains (N0, N7, N14, N20, and N28) and 29 media (M1–M29) are indicated. Color variation represents the clusters, corresponding to those indicated in AD.

Correlations of bacterial growth dynamics to environmental and genomic contributions

To discover deeper biological insights, additional interception of the clusters from 2 to 10 was performed (Table S6), according to the exhaustive Z-score analysis (Table S5). GBDT models predicted the contribution of chemical components and genome size to the growth dynamics, which were divided into 2 to 10 clusters with the four clustering methods (Fig. 6A). As shown above (Fig. 4), glucose, phosphate, and sulfate presented the highest importance when only two clusters were divided (Fig. 6A). Intriguingly, the number of clusters increased was likely triggered the increased importance of genome size and decreased importance of glucose and phosphate (Fig. 6A). It suggested that the genome size played a more prominent role in finer classifications along with the reduced impact of chemical components. To achieve a statistical demonstration, pairwise correlation analysis was conducted on the number of clusters and the feature importance of chemical components and genome size (Figs. 6B and S8). The importance of genome size was significantly positively correlated to the number of clusters, whereas those of glucose and phosphate were negatively correlated (Fig. 6B). This indicated a reverse contribution of genome and environment to growth dynamics.

Fig. 6
figure 6

Relationships among genome, environment, and growth dynamics. (A) Feature importance corresponding to changes in the number of clusters. Four blocks show the four clustering methods: spectral clustering, K-means clustering, hierarchical clustering, and GMM. The number of clusters is all from 2 to 10. Purple gradation indicates the value of feature importance. (B) Correlation analysis. Left and right panels show the Spearman correlation coefficients and the corresponding statistical significance, respectively. The variables are indicated on both axes. Color gradation from cyan to red indicates the correlation coefficients from − 1 to 1. Statistical significance is shown in red gradation, on a logarithmic scale. Asterisks indicate the thresholds of statistical significance, * p < 0.05, ** p < 0.01, and *** p < 0.001. (C) Network linking genome and environment to growth dynamics. Nodes correspond to the variables, with colors green, pink, and blue representing environment, genome, and growth dynamics, respectively. Edges indicate the pairwise correlations among the variables, of which the width represents the statistical significance as follows: − log10(p-value) ≤ 1.3, 1.3–2, 2–3, > 3 are shown in thin gray, thin colored, medium, and thick lines, respectively. Red and cyan stand for negative and positive correlations, respectively.

A sign-weighted correlation network was built to clearly visualize the relationships among environment, genome, and growth dynamics (Fig. 6C). Genome size was the only feature showing a positive correlation with growth dynamics. In contrast, chemical components showed negative correlations with growth dynamics. Positive correlations were observed among chemical components like glucose, PO43−, and NH4+, which all had negative correlations with both genome size and growth dynamics. Additionally, GBDT prediction with one-hot encoding analysis was performed, considering genome size as a discrete feature (Table S7). Pairwise correlation analysis of genomes, chemical components, and the number of clusters confirmed the statistically significant positive and negative correlations between genomes and chemical components relative to the number of clusters (Figs. S9 and S10). The resultant network (Fig. 7A) aligned closely with Fig. 6C, suggesting the presence of genomic and environmental layers influencing bacterial growth dynamics. These findings suggest that, within the current dataset, genome size and chemical components hierarchically contribute to bacterial growth dynamics. Genomic variations lead to specific regulations involved in determining growth parameters, while environmental changes induce overall biochemical adjustments, allowing for fine-tuning of growth phases (Fig. 7B). Considering genomic factors as the innate physiological and metabolic capacity of a cell, and environmental factors as external conditions under which this capacity is expressed, bacterial growth can be viewed as the process of maximizing the use of their available genetic resources to survive as efficiently as possible within their specific environmental context. From a fundamental biological perspective, living organisms must operate within the constraints imposed by their external environment. The hierarchical contribution of genome and environment is feasible; nevertheless, further investigation using extensive datasets is required to examine this hypothesis.

Fig. 7
figure 7

Bacterial growth dynamics influenced by genome and environment. (A) A network illustrating the hierarchical relationship between the genome and environment in relation to growth dynamics. Nodes represent variables, with colors green, pink, and blue representing environment, genome, and growth dynamics, respectively. Edges depict pairwise correlations, with width reflecting statistical significance as follows: − log10(p-value) ≤ 1.3, 1.3–2, 2–3, and > 3 are shown as thin gray, thin colored, medium, and thick lines, respectively. Red indicates negative correlations, while cyan indicates positive ones. Pink and green dashed rings represent layers formed by the genome and the environment, respectively. (B) Proposed mechanisms detailing how genomic and environmental factors contribute to bacterial growth. Color variation corresponds to that indicated in A.

Discussion

The present study addressed the biological question of how genomic and environmental factors impacted bacterial growth dynamics by data mining. The analyses revealed that the relative importance of genome size and chemical components in predicting growth dynamics changed with the number of clusters, with chemical components being more critical for broader classifications and genome size gaining importance in finer classifications (Fig. 6). It might be the reason why the dominant factors for growth dynamics differ from those for growth parameters, i.e., the genome size was the decision-maker for bacterial growth parameters, i.e., r, K, and t (Fig. 2), while the growth dynamics were primarily determined by the chemical components (Fig. 4). Such a high priority of chemical components in the overall growth dynamics somehow agreed with the finding of environments determining evolutionary trajectory37. Glucose and sulfate were found to be the determinative chemical components (Fig. 4A), supported by the report that these two components were the deciding factors in the fate of bacterial survival as a risk-divergent strategy15. The increased genomic impact on distinguishing finer growth dynamics (Fig. 6) suggested that genetic information shaped the details of bacterial growth. As genes determine the core machinery and regulatory networks, genetic contributions were found to be quantitative to growth rates across medium conditions in E. coli14,38,39. The high priority of genome size in deciding the growth parameters was well explained by why microbial growth rates could be predicted from genetic features40,41. Since the current study treated genome size as a genetically quantitative variable, it is difficult to provide a specific biological interpretation of genetic function or metabolic processes, as hundreds of genes were deleted. Future research on single-gene deletions, especially regulatory units or metabolic pathways, is necessary to develop a clear biological understanding.

Moreover, the observed positive and negative correlations between growth dynamics and genomic or environmental factors (Fig. 7A) indicated that the genome and environment could drive growth dynamics in opposite directions. The interplay between genetic and environmental variations is crucial in living systems42,43, often manifesting as epistasis within fitness landscapes44, and playing roles in evolution45 and adaptation46. Our previous findings of the canceling effect in bacterial growth34 and negative epistasis in transcriptome reorganization35,47 supported the existence of opposing correlations between genome size and chemical components. This suggested not only potential trade-off mechanisms between genome and environment for maintaining homeostasis, but also a fundamental principle of negative epistasis in the genetic and environmental interplay. As hypothesized, the genome played a specific role in regulating bacterial growth parameters, while the environment provides general fine-tuning of growth phases (Fig. 7B). The comprehensive understanding of bacterial growth curves from well-controlled experimental conditions could offer a broader context for microbial population dynamics that influence ecological and evolutionary processes in natural environments48 and the spatiotemporal metabolic modeling of microbial life49.

Multiple validations were employed in the present study to secure the reliability and robustness of the findings. Four widely used clustering methods were adopted, i.e., spectral50,51, K-means52,53, hierarchical53,54, and GMM51,55. Three statistical metrics, i.e., Silhouette, Davies–Bouldin, and Calinski–Harabasz, were employed, as they captured different aspects of cluster compactness and separation56,57. To avoid the bias of these statistical metrics, the Z-score normalization was applied as a general strategy to provide a robust framework for evaluating clustering performance58,59. All these considerations in data mining helped to draw a common conclusion of genomic and environmental impacts on bacterial growth dynamics. Note that an alternative strategy using one-hot encoding, considering genome size as discrete features, was also conducted to verify the common conclusion. The GBDT prediction (Fig. S11) and SHAP evaluation (Fig. S12) of the two clusters also found that the chemical components were the primary determinants for bacterial growth dynamics, similar to that shown in Fig. 4. GBDT prediction with one-hot encoding analysis also verified the the relationships of genomes, chemical components, and growth dynamics (Figs. 7A, S9 and S10), consistent with Fig. 6. The agreement between the two alternative approaches and across the four clustering methods well demonstrated the generality of such a hierarchical attribution of the genomic and environmental factors to bacterial growth dynamics.

The present data mining successfully achieved biological insights on environmental and genomic contributions to bacterial growth dynamics; nevertheless, the limitations remained in the dataset of bacterial growth dynamics and the time series analysis method. The present dataset included five E. coli strains of varied genome sizes and 29 media combinations with eight chemical components, which might be insufficient to draw a general conclusion of the relationships among bacterial growth dynamics, genome, and environment. The growth curves of other genomic varieties, e.g., single-gene knockouts and consecutive deletions60,61, and more environmental diversity, e.g., more chemical components for rich media15,27, were required to support the findings for drawing a common conclusion. Despite the complexity of bacterial population dynamics, big data from well-controlled high-throughput experiments were powerful in discovering biological laws62,63, such as the bacterial growth profiling in the laboratory mirroring the eco-evolution in nature64. Secondly, the present study used DTW to analyze the bacterial growth curves, as they were typical time series data. Although DTW effectively captured the nonlinear temporal patterns of growth curves, its approximation might smooth out the rapid fluctuations or locally anomalous dynamics65, thereby failing to distinguish insignificant changes on short timescales. Alternative methods for analyzing time series, such as derivative and weighted DTW65,66,67, were required to provide a solid conclusion. Additionally, experimental errors like recording noise may be considered in the future for more extensive use of machine learning and proper interpretation of data-driven approaches in biological experiments. In summary, data mining of growth curves provided a generic and environmental clue for understanding the dynamical changes of living systems, as an alternative approach for biological studies that usually address the detailed mechanisms underlying the specific findings.

Materials and methods

Bacterial strains and media

The wild-type E. coli W3110 and its derivatives, i.e., four reduced genomes, were used and assigned as N0 and N7, 14, 20, 28, respectively14. These strains were all from the National BioResource Project, National Institute of Genetics, Shizuoka, Japan (the KHK collection), and their genetic constructions were previously reported68. The details of the genetic information, such as deleted genomic regions and genes, have been provided either in previous publications14,68 or in the public database for strain distribution (https://shigen.nig.ac.jp/ecoli/strain/download). In brief, in comparison to N0, approximately 260, 731, 908, and 1,017 genes were deleted in N7, 14, 20, and 28, respectively. The growth media for E. coli were based on M63 minimal medium and systematically varied across 29 recipes, as described previously69. Seven pure compounds were used in the media, i.e., glucose, K2HPO4, KH2PO4, MgSO4, thiamine, FeSO4, and (NH4)2SO4 (Wako Fijufilm), which resulted in eight chemical components (Table S1).

Data acquisition

Growth curves were obtained as performed in our previous studies14,70. In brief, the E. coli strains were cultured in fresh media in a 96-well microplate (Costar). Every six wells at varied locations were used as the biological replications. The 96-well microplate was incubated in a plate reader (Epoch2, BioTek) with a rotation rate of 567 rpm at 37 °C. The temporal growth of the E. coli cells was detected by measuring the absorbance at 600 nm, and readings were obtained at 30-minute intervals for 48 h. A total of 870 growth curves, i.e., time series data, of experimental reliability were acquired (Table S2).

Calculation of the growth parameters

Bacterial growth parameters were programmatically extracted from the time-series data (i.e., growth curves) using a custom Python script leveraging the pandas and numpy libraries. The raw data, i.e., experimental records, were subjected to baseline correction and preprocessing for each growth curve to remove background noise and ensure suitability for logarithmic transformation. The growth parameters were algorithmically determined according to the previous reports15,29, as follows. The carrying capacity (K) was identified from the peak of the curve after smoothing with a 3-point moving average. The onset of a sustained increase in OD600 records defined the lag time (t). The growth rate (r) was calculated from the exponential phase by identifying its boundaries, computing the log-transformed OD600 gradient, removing statistical outliers, and smoothing with a 3-point moving average. The results were summarized in Table S3.

Dynamic time warping

Dynamic time warping (DTW) was used to evaluate the similarity between any pair of growth curves, as previously33. The following equations were used for the calculations.

$$X=\left({x}_{1},\ldots,{x}_{n}\right),\quad Y=\left({y}_{1},\ldots,{y}_{m}\right)$$
(1)
$$\text{DTW}\left(X,Y\right)={\mathop{\text{min}}_{W}}\sum_{\left(i,j\right)\in W}d\left({x}_{i},{y}_{j}\right)$$
(2)

Here, DTW aligns two time series \(X\), \(Y\) by searching for a monotonic warping path \(W\) from \(\left(\text{1,1}\right)\) to \(\left(n,m\right)\) using unit steps \(\left(\text{1,0}\right),\) \(\left(\text{0,1}\right)\), and \(\left(\text{1,1}\right).\) The local cost is \(d\left({x}_{i},{y}_{j}\right)=\left|{x}_{i}-{y}_{j}\right|\), where \(i\in \{1,\ldots,n\}\)and \(j\in\{1,\ldots,m\}\) are index positions in \(X\) and \(Y\). At each step along the warping path \(W\), the pair \(\left(i,j\right)\)denotes the current aligned samples \({x}_{i}\) and \({y}_{j}\). DTW was computed using the fastdtw (fastdtw.fastdtw) approximate algorithm in Python with the default search radius parameter(radius = 1). The calculated DTW distances of all pairs of 870 growth curves were summarized in Table S4.

Clustering analyses

All clustering analyses used the scikit-learn library. Four clustering algorithms, i.e., spectral clustering, K-means clustering, hierarchical agglomerative clustering, and Gaussian mixture model (GMM), were used to analyze the structural relationships of the DTW distances. The DTW distance matrix (Table S4) was first transformed into a Gaussian similarity matrix as follows.

$${A}_{ij}=\text{exp}!\left(-\frac{{D}_{ij}^{2}}{2{{\upsigma}}^{2}}\right)$$
(3)

where \({A}_{ij}\) represents the similarity between the i-th and j-th growth curves, \({D}_{ij}\) is their pairwise DTW distance, and σ denotes the kernel bandwidth controlling the smoothness of similarity decay. Here, σ was defined as the standard deviation of all entries in the DTW distance matrix to map the non-Euclidean DTW space into a continuous similarity space. The clustering was performed in Python with scikit-learn. Spectral clustering used sklearn.cluster, in which k denotes the target number of clusters.SpectralClustering on the Gaussian affinity matrix \(A\)converted from DTW distances, with n_clusters = k, affinity = ‘precomputed’, and random_state = 42. Hierarchical agglomerative clustering used sklearn.cluster.AgglomerativeClustering directly on the DTW distance matrix with metric = ‘precomputed’, linkage = ‘average’, and n_clusters = k. Because K-means and Gaussian mixture models (GMM) require Euclidean features, DTW distances were first embedded by metric MDS (sklearn.manifold.MDS, n_components = 5, dissimilarity = ‘precomputed’, random_state = 42) to obtain 5-dimensional coordinates. K-means and GMM were then run with sklearn.cluster.KMeans (n_clusters = k, random_state = 42), and sklearn.mixture.GaussianMixture (n_components = k, covariance_type = ‘full’, random_state = 42), respectively.

Statistical evaluation

Evaluation of the clustering results used the scikit-learn metrics of Silhouette and Calinski–Harabasz/Davies–Bouldin, which were computed on the DTW distance matrix with metric = ’precomputed’ and a metric MDS embedding of the DTW matrix, respectively. To obtain the aggregate scores, Z-scores were applied and computed using sklearn.preprocessing. StandardScaler. The mean was used as Zscore_Total. Configurations with invalid labels (fewer than two clusters) or runtime errors were recorded as NaN and excluded from aggregation.

Gradient boosting decision tree

The gradient boosting decision tree (GBDT) was used to predict the contribution of the genome and environment to bacterial growth, as applied in our previous studies15,34,64. The ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier implementations from the scikit-learn library were employed. The input features were the genome size and eight chemical components (thiamine, K+, PO43−, Fe2+, SO42−, NH4+, Mg2+, and glucose). The regression and classification tasks were used to predict the three growth parameters (i.e., t, r, K) and the discrete cluster labels of growth curves, respectively. The main parameter settings for GBDT are as follows: the maximum depth of the weak learner (max_depth) was set to 4 to control the model complexity and prevent overfitting while retaining the necessary nonlinear expression ability: the learning rate (learning_rate) was set to 0.05 to reduce the impact of a single tree on the overall model and improve generalization performance; the number of base learners (n_estimators) was set to 150 to ensure that the model can fully converge at a low learning rate; and the data partition ratio (test_size) was set to 0.2 to retain as much data as possible for training. The models were repeatedly trained with five different random seeds (1, 2, 3, 4, and 5) to reduce the chance of a single data partition and obtain robust results. The mean and standard deviation of the five replications were calculated and used as the results.

One-hot encoding

The genome size was represented using one-hot encoding to reduce the impact of nonlinear features on prediction, as used in biological sequence data71. Directly inputting genome size as a continuous variable might introduce an artificial linear order or proportional relationship into the model. It would probably be inappropriate to reflect its categorical role in the analysis accurately. To avoid this bias, one-hot encoding was applied. The genome size was discretized into several intervals and represented as a binary vector with only one position set to 1. That is, the five genomes were represented as (1, 0, 0, 0, 0), (0, 1, 0, 0, 0), (0, 0, 1, 0, 0), (0, 0, 0, 1, 0), and (0, 0, 0, 0, 1), which were used to replace the genome size in the GBDT prediction.