Abstract
Silica nanoparticles have been widely adopted as carriers for drug delivery and components of multifunctional nanocomposites, but potentially lead to off-target accumulation and subsequent cytotoxic effects. Previous works explored data-driven methods to improve the evaluation efficiency and supporting the rational design of nanomedicines. However, two challenges still need to be considered. The first is data leakage problem, as previous methods incorporate either evaluation stage features (e.g., Viability_indicator, Positive_control) or rely on one-hot encoding that requires prior knowledge of all categorical values, leading to data leakage risk. Second, the model has poor generalization ability. One-hot encoding fixes the dimensionality of the feature space, causing the model to fail when faced with unseen class values. In this work, we propose a pre-trained model based framework for silica nanoparticles Cellular Toxicity Prediction. To address the data leakage problem, we first removed features that come from the drug evaluation stage such as Viability_indicator, Positive_control, SiO2NP_label, Interference_testing, and Assay_viability. And then we utilize the embedding layer from the TabPFN to process the original categorical values into dense vectors. To improve the model generalizability, we employ in-context learning on pre-trained TabPFN, which has already learned a large number of patterns from amount of synthetic data. The TabPFN model only needs to adapt the output prediction distribution through in-context learning. Experimental results on publicly available dataset demonstrate that our framework not only achieves state-of-the-art classification performance but also effectively mitigates data leakage and improves generalizability for novel nanoparticle formulations. The code and data are shared in https://github.com/AppleMax1992/pre-trained_nanosilica.
Similar content being viewed by others
Introduction
Silica nanoparticles have become a prominent subject in the realm of bioactive agents research1, driven by their outstanding merits such as large specific surface area2,3, tailored aperture4, and easy surface functionalization5,6,7. It not only preserves active drug components but also function as foundational components for nanocomposite materials, such as polymeric drug carriers, responsive hydrogels, and hybrid metal silica frameworks, thereby enabling controlled release8, enhanced bioavailability1, and improved stability of therapeutic agents9. However, the utilization of silica nanoparticles for drug delivery remains critical issues. For example, Mesoporous Silica Nanoparticles (MSNs) lack precise regulation of the loaded drugs under physiological conditions, potentially leading to off-target accumulation and subsequent cytotoxic effects, and pose significant challenges to the clinical application of these drug formulations.
The objective of silica nanoparticles cellular toxicity prediction is to identify the potential cytotoxicity of silica nanoparticles by integrating multi-variable features, including intrinsic nanoparticle properties, silica nanoparticles specific toxicity indicators, exposure, experimental conditions, and the cell-related biological categorical variables. This task improves the efficiency of evaluating silica nanomedicines and provides strong support for rational design.
Previous works10,11,12 typically rely on empirical methods to investigate the toxicity of silica nanoparticles. These approaches adjust various indicators to set up biological experiments and analyze the toxicity of silica-based drugs, Despite, these works provide fundamental data and experience for cytotoxicity detection, but requiring extensive manual experimentation, costly reagents, and long evaluation periods, which limits the efficiency of toxicity detection and rational design for silica nanoparticles. Some studies have proposed data-driven approaches13,14,15 for silica nanoparticle cellular toxicity prediction. These methods collect experimental or literature-derived datasets and incorporate machine learning models with key features including Intrinsic NP Properties, Exposure & Experimental Conditions, Cell-Related Biological Context and Response Readout for predicting the cytotoxicity of silica-based drugs under different conditions, which significantly improve the efficiency of silica toxicity prediction tasks. Recently, Martin et al.15 proposed a method to detect the toxicity of silica synthetic materials by modeling 13 key features such as (Concentration, \(\mathrm {SiO_{2}NP\_medium\_serum}\), Cell_morphology, and Cell_organ) through Catboost.
However, there are still two challenges in the existing work needs to be considered.
-
One challenge comes from the data leakage, existing machine learning methods incorporate indicators such as Viability_indicator, Positive_control, SiO2NP_label, Interference_testing, and Assay_viability as the inputs. However, these variables are measured at experimental evaluation stage. When predicting the toxicity of a new silica-based drug candidate, the values of these indicators are not yet available. Also, existing methods rely on one-hot encoding16 for dealing with categorical values. This technique requires prior knowledge of all possible category levels, leading to additional data leakage risk17.
-
Another challenge lies in the poor model generalizability. Existing methods rely on one-hot encoding for categorical variables, which fixes the dimensionality of the model. For example, if the Cell_organ feature in historical data contains three categorical values “skin”, “lung”, and “blood” the encoded feature would expand into three columns. When a new Cell_organ such as “brain” appears in future samples, the feature space will expand to four columns, leading to inconsistent feature dimension between training and inference stage, and ultimately causing model failure.
In this work, we present a pre-trained model based framework for silica nanoparticles cellular toxicity prediction, integrating advanced machine learning model with nanomaterial safety assessment. Silica nanoparticles have attracted extensive interest in drug delivery, imaging, and multifunctional therapeutic systems, while their potential cytotoxicity remains a major barrier for clinical application. Reliable prediction is therefore crucial not only for biosafety evaluation but also for guiding the rational design of safer and more effective nanocarriers. To address the data leakage problem, we first removed features that come from the drug evaluation stage such as Viability_indicator, Positive_control, SiO2NP_label, Interference_testing, and Assay_viability. And then we adopt the embedding layer from TabPFN to convert categorical data into dense vectors. Compared to sparse one-hot matrices, those dense and continuous vectors are more conducive to the model’s decision boundary demarcation. To improve the model generalizability, we adopt in-context learning18 on pre-trained TabPFN19. The TabPFN has been pre-trained on synthetic data, and learned a large number of patterns. When applied to the downstream toxicology prediction task, the model only needs to fit the output prediction distribution through in-context learning without retraining the model or adjusting the model parameters. Extensive experiments on an open-source dataset of 32 physicochemical and biological descriptors demonstrate that our framework achieves state-of-the-art Accuracy, mitigates data leakage, and substantially improves generalizability20. Beyond methodological advances, this work provides a computational tool to accelerate safe-by-design strategies for silica nanomaterials, thereby advancing their biomedical applications in cancer therapy, diagnostics, and regenerative medicine.
Methods
Overview
Figure 1 illustrates our framework for predicting the cytotoxicity of silica nanoparticles based on the pre-trained TabPFN and in-context learning. The model’s input features include intrinsic nanoparticle properties, exposure and experimental conditions, and the cell-related biological variables. The output is the cytotoxicity labels for the silica nanoparticles. The dataset is collected from literature and real-world experimental data by Maten et al.15. For fair comparison, we set the data before 2017 for model training and validation, and the data after 2017 is used for testing. We set the latest pre-trained TabPFN as our main backbone network, and further compared various mainstream machine learning and neural network models with it. Unlike conventional classifiers that require parameter updates during training, TabPFN performs prediction through in-context learning. During inference, the labeled training samples are concatenated with the test sample and jointly fed into the pre-trained transformer as contextual input. The model then directly outputs the predictive distribution for the test instance conditioned on this context, without any gradient-based retraining. For better model performance, we introduce the SHAP value to analyze the importance of features and select the features that reach the top 90% of cumulative importance as key features. Finally, we designed three sets of experiments to validate the performance of the framework: K-fold cross-validation without feature selection, K-fold cross-validation21 with feature selection, and independent external testing.
Dataset
We adopt the publicly available dataset released by Martin et al.15, which compiles experimental and literature-derived data on silica nanoparticle cellular toxicity published between 2004 and 2022. The dataset contains 5,030 samples collected from 141 independent studies and includes 32 physicochemical, biological, and experimental descriptors covering intrinsic nanoparticle properties, exposure conditions, and cell-related biological context.
Following the original dataset construction by Martin et al., all missing values arising from incomplete or heterogeneous literature reporting were consistently encoded as the categorical label “not_determined”, rather than being numerically imputed. In addition, several physicochemical descriptors such as hydrodynamic size, zeta potential, and polydispersity index (PDI) were represented as categorical ranges when precise numerical values were unavailable. As this missingness is literature-driven rather than outcome-driven, no additional missing-value imputation was applied in our preprocessing pipeline.
To ensure a fair comparison with previous studies, we follow the original data splitting strategy of Martin et al., using samples published between 2004 and 2016 for training and validation, and reserving data from 2017 to 2022 as the test set. The temporal cutoff at 2017 was not arbitrarily chosen but reflects the protocol established in the original dataset construction. Beyond comparability, this strict temporal split is designed to reflect realistic deployment scenarios, where models trained on historical literature are applied to newly published nanoparticle formulations. Importantly, this setting intentionally preserves temporal confounding and distribution shift, enabling an evaluation of the model’s robustness to unseen laboratories and evolving experimental practices, and thereby providing an implicit assessment of inter-laboratory measurement bias. Moreover, this strict temporal split naturally evaluates the model’s robustness to unseen laboratories and evolving experimental practices, providing an implicit assessment of inter-laboratory measurement bias.
Data preprocessing
We preprocess the data in two steps, including feature selection and data standardization. For feature selection, the features in the dataset can be summarized into four categories: intrinsic nanoparticle properties, exposure and experimental conditions, cell-related biological context, and response readout. However, response readout features such as Viability_indicator, Positive_control, \(\mathrm {SiO_2}\) NP_label, Interference_testing, and Assay_viability are measured only during experimental evaluation and are therefore unavailable at inference time. These variables were excluded in advance to avoid data leakage. To further reduce noise and improve robustness, we designed an independent experiment and applied SHAP (SHapley Additive exPlanations) analysis to quantify feature contributions. Specifically, the SHAP-based feature importance estimation was conducted exclusively on the training data. The absolute SHAP values were aggregated to obtain a global importance ranking, and the smallest subset of features whose cumulative importance reached 90% was retained as key predictors. Once selected, this feature subset was fixed and consistently applied in all subsequent experiments, including cross-validation, to ensure experimental consistency and prevent information leakage.
In the data standardization stage, numerical features were normalized using z-score standardization (StandardScaler)22 to mitigate scale discrepancies arising from heterogeneous experimental protocols and measurement conditions across different laboratories. Categorical variables describing experimental settings and biological context were retained explicitly and directly fed into TabPFN, allowing the model to implicitly capture protocol-related and laboratory-related variations through its native categorical embeddings.
Feature importance analysis
Feature importance analysis is essential for understanding the impact of different features on toxicity prediction, since there is no unified standard for toxicity analysis experiments, more features will first increase the difficulty of data collection. Furthermore, additional features will also lead to the introduction of noisy data and more missing values.
Preliminary analysis of numerical features and linear separability
After data preprocessing, we first examined the numerical features Concentration_\(\mu\)g/ml, Primary_size_nm, Exposure_time_h, and Surface_area_m2/g as the input to model the binary outcome for Cell_viability, to determine whether linear models are suitable for our dataset.
From Fig. 2, we can see that kernel density plots were generated for each feature, grouped by cell viability labels, and their corresponding AUC values were calculated to measure discriminative power. The results indicate that most features exhibit overlapping distributions between the two classes, with AUC values close to 0.5, suggesting weak linear separability. For example, concentration shows a particularly poor discriminative ability (AUC = 0.26), while primary size, exposure time, and surface area display only marginal improvement (AUC \(\approx\) 0.53 to 0.54). This preliminary analysis suggests that linear models struggle to capture the relationship between these numerical features and cell viability, and more complex modeling approaches could be necessary to uncover non-linear patterns.
Global feature importance revealed by SHAP
Next, we employed the advanced TabPFN to model features and applied SHAP analysis to quantify feature importance.
As shown in Fig. 3, the SHAP analysis indicates that Concentration_\(\mu\)g/ml) is the most influential predictor, followed by \(\mathrm {SiO_2}\)_NP_medium_serum, Surface_modification, and Hydrodynamic_size_culture_nm, all of which contribute substantially to the model’s predictions. The dominant role of nanoparticle concentration is consistent with the well-established dose–response relationship in nanotoxicology, where increased exposure levels are closely associated with enhanced reactive oxygen species (ROS) generation and oxidative stress. Serum-related features (e.g., \(\mathrm {SiO_2}\)_NP_medium_serum) also rank highly, reflecting the importance of protein corona formation in biological media. The adsorption of serum proteins onto nanoparticle surfaces can significantly alter their effective size, surface properties, and cellular uptake behavior, thereby modulating cytotoxic responses. Similarly, Surface_modification and Hydrodynamic_size_culture_nm are closely related to nanoparticle–cell membrane interactions, which have been widely reported as key factors influencing membrane disruption and intracellular transport. In addition to physicochemical descriptors, several biological and experimental categorical variables, such as Cell_culture, Cell_id, and Primary_size_nm also exhibit notable influence, highlighting the role of cellular context and experimental conditions in cytotoxicity outcomes. In contrast, features such as Surface_area_m2/g and \(\mathrm {SiO_2}\)_NP_synthesis show comparatively lower contributions in our setting.
Feature interaction effects and mechanistic alignment
To further examine whether TabPFN captures interaction effects beyond individual feature contributions, we analyzed SHAP interaction plots for key physicochemical and experimental variables, as shown in Fig. 4. These plots provide insight into how pairs of features jointly influence the model’s cytotoxicity predictions.
As illustrated in Fig. 4a, the interaction between Concentration_\(\mu\)g/ml and \(\mathrm {SiO_2}\)_NP_source reveals a clear concentration-dependent pattern. At very low concentrations, SHAP values are predominantly negative, indicating a limited contribution to cytotoxicity regardless of nanoparticle source. As concentration increases, SHAP values shift toward positive values, with noticeable dispersion across different sources, suggesting that source-related differences become more influential under higher exposure levels. This indicates that TabPFN captures a nonlinear dose–context interaction rather than relying on concentration alone. Figure 4b demonstrates the interaction between \(\mathrm {SiO_2}\)_NP_medium_serum and Surface_modification. Distinct surface modification categories exhibit systematically different SHAP value distributions under the same serum condition, with some modifications consistently associated with negative contributions and others shifting toward positive contributions. This pattern suggests that the effect of surface chemistry on cytotoxicity is strongly modulated by the biological medium, reflecting the combined influence of surface properties and protein corona formation. The interaction between Hydrodynamic_size_culture_nm and Surface_modification in Fig. 4c further highlights the context-dependent role of particle size. While hydrodynamic size alone does not exhibit a monotonic contribution, its interaction with surface modification leads to structured variations in SHAP values, indicating that size-related effects are conditional on surface chemistry rather than acting as an isolated predictor. Finally, Fig. 4d shows the interaction between Surface_modification and \(\mathrm {SiO_2}\)_NP_medium_serum, where specific combinations consistently produce either positive or negative SHAP contributions. This result underscores that the cytotoxic impact of surface modification cannot be interpreted independently of the surrounding biological environment, and that TabPFN effectively captures such coupled effects.
Overall, these SHAP interaction patterns demonstrate that TabPFN learns meaningful, non-additive relationships among exposure level, nanoparticle physicochemical properties, and experimental context. Importantly, the learned interactions are consistent with established structure–activity relationships in nanotoxicology, where cytotoxicity emerges from the joint influence of dose, particle characteristics, and biological environment rather than from any single factor in isolation.
Evaluation metrics
To comprehensively evaluate the performance of the model, we adopt five typical classification metrics23 including Accuracy, Precision, Recall, F1-score, AUC-ROC score.
We then introduce the meaning and function of each indicator, where TP (True Positives), TN (True Negatives), FP (False Positives), and FN (False Negatives) indicate whether samples are correctly or incorrectly classified as cytotoxic or non-cytotoxic. With TP representing cytotoxic samples correctly identified as toxic, TN representing non-cytotoxic samples correctly identified as non-cytotoxic, FP representing non-cytotoxic samples incorrectly classified as toxic, and FN representing cytotoxic samples incorrectly classified as non-cytotoxic.
Accuracy is used to measure the overall proportion of correctly classified samples. Suitable when the dataset is balanced, but can be misleading under class imbalance, the formula is defined as follows.
Precision rate refers to the proportion of truly positive results out of all positive predictions. A high Precision rate means fewer false positives, which is especially important when false positives are costly (e.g., in medical diagnosis).
Recall reflects the ability to capture all actual positives. High Recall is critical when missing a positive case is costly (e.g., disease prediction).
The F1-score is a metric used in machine learning and statistics to evaluate the performance of a classification model. It combines Precision and Recall, making it especially useful when dealing with imbalanced classes.
The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is used to evaluate the overall performance of a binary classification model. The ROC curve is plotted with the false positive rate (FPR) on the horizontal axis and the true positive rate (TPR) on the vertical axis. The FPR is defined as \(\frac{FP}{FP+TN}\), representing the proportion of negative samples incorrectly predicted as positive. The TPR is defined as \(\frac{TP}{TP+FN}\), representing the proportion of positive samples correctly identified. The AUC is the area under the ROC curve, ranging between 0 and 1 can be calculated as follows.
A larger value indicates a stronger model’s ability to distinguish between positive and negative samples.
Results and discussion
To evaluate the performance of the model on the silica nanoparticles Cellular Toxicity Prediction task, traditional machine learning and neural networks are used as baselines for comparison, including GradientBoost24, RandomForest25, LogisticRegression26, XGBoost27, SVC28, BPNN29, CatBoost30, FTTransformer31 and TabTransformer32. For all experiments involving TabPFN, we adopt the context size wtih all training samples, 12 transformer layers, an embedding dimensionality of 512, and 8 attention heads. The model is pre-trained and applied in an in-context learning manner. This design choice ensures reproducibility and avoids introducing additional tuning bias in few-shot settings. We divide the data into internal data and external data, to avoid the risk of data leakage. The internal data includes training and validation data from 2004 to 2016 for the initial evaluation of the model, and the external data includes test data from 2017 to 2022 for the secondary evaluation of the model and to verify the generalization ability of the model on unknown distributions.
We conducted three experiments including internal Validation without feature selection, Internal Validation with feature selection, and External Validation without feature selection. We introduce them in details in the following sections.
Internal validation without feature selection
To verify the effectiveness of the proposed framework, we first compare the performance of different models on internal data. The data is evaluated under a 5-fold cross-validation setting, where the dataset is randomly partitioned into five folds, one used for validation and the remaining four for training in each round.
From Table 1, we observe that under 5-fold internal validation without feature selection, TabPFN consistently achieves the strongest and most stable performance across all evaluation metrics, outperforming both traditional machine learning methods and recent transformer-based tabular models. This observation is well aligned with our design choice of avoiding leakage-prone feature engineering, directly handling categorical variables, and leveraging a pre-trained probabilistic prior via in-context inference. Compared with the strongest conventional baseline, GradientBoost, TabPFN demonstrates clear and consistent improvements across all metrics, including Accuracy (0.8885 vs. 0.8577), Precision (0.8833 vs. 0.8552), Recall (0.8703 vs. 0.8287), F1-score (0.8760 vs. 0.8388), and AUC-ROC (0.9563 vs. 0.9315). These gains indicate that TabPFN is able to capture more informative decision boundaries without relying on explicit feature selection. Moreover, TabPFN maintains relatively low variance across folds (e.g., Accuracy ±0.0047), suggesting stable generalization behavior under different data splits. TabPFN further widens the performance gap over RandomForest, achieving substantial improvements in Recall (+13.1 percentage points), F1-score (+11.7 percentage points), and AUC-ROC (+3.88 percentage points). This highlights its superior capability in modeling complex nonlinear interactions and cross-feature dependencies that are difficult to capture using ensemble-based tree methods. Similar advantages are observed when compared with XGBoost, SVC, and Logistic Regression, all of which exhibit noticeably lower Recall and F1-scores, indicating weaker sensitivity to cytotoxic samples. The high AUC-ROC score of 0.9563, together with a well-balanced Precision–Recall trade-off (0.8833 and 0.8703), suggests strong class separability and effective discrimination of cytotoxic cases. This is particularly important in cytotoxicity prediction, where failing to identify hazardous samples can have severe downstream consequences. In contrast, neural-network-based baselines such as BPNN show unstable behavior with large variance across folds, while even CatBoost, despite its competitive performance, remains inferior to TabPFN across all evaluation metrics and appears more sensitive to data partitioning.
Overall, these results confirm that TabPFN provides a robust, high-performing, and low-variance solution for cytotoxicity prediction in the absence of feature selection. Its consistent superiority across all metrics demonstrate strong generalization ability and practical reliability, making it especially suitable for realistic toxicological screening scenarios where feature engineering is limited or potentially unreliable.
Internal validation with feature selection
To further examine the impact of feature selection on model performance, we conduct experiments under the same 5-fold internal validation protocol after retaining the selected key features.
As shown in Table 2, TabPFN continues to deliver the best overall performance under the feature selection setting, achieving consistent improvements across all evaluation metrics while maintaining low variance across folds. Specifically, TabPFN attains an Accuracy of 0.8967, Precision of 0.8899, Recall of 0.8827, F1-score of 0.8861, and AUC-ROC of 0.9639, with small standard deviations (e.g., Accuracy ±0.0055 and AUC-ROC ±0.0018). Compared with the no feature selection setting, these results indicate enhanced stability and stronger class separability when informative features are retained. Relative to the strongest baseline, GradientBoost, TabPFN demonstrates clear and consistent gains, improving Accuracy by 4.17 percentage points (0.8967 vs. 0.8550), Precision by 3.69 points, Recall by 5.79 points, F1-score by 5.08 points, and AUC-ROC by 3.54 points. The improvements are even more pronounced when compared with RandomForest, with increases of +8.15 points in Recall, +11.63 points in F1-score, and +3.87 points in AUC-ROC. These results highlight TabPFN’s superior ability to exploit informative feature subsets while preserving robust generalization. Most traditional baselines, including Logistic Regression, SVC, and XGBoost, exhibit modest and relatively uniform improvements after feature selection, suggesting that pruning redundant or noisy features benefits linear and shallow nonlinear models to a limited extent. In contrast, GradientBoost shows only marginal changes, while BPNN continues to display substantial variance across metrics, reflecting persistent training instability and sensitivity to data splits. Although CatBoost also benefits from feature selection, its overall performance and stability remain inferior to TabPFN, indicating that even strong tree-based ensembles cannot fully match the robustness of a pre-trained in-context learner. From a safety-critical perspective, the increase in Recall to 0.8827 without sacrificing Precision (0.8899) is particularly important for cytotoxicity prediction, as it reduces the likelihood of missed hazardous samples while avoiding excessive false positives. The corresponding gains in F1-score and AUC-ROC further confirm improved class discrimination and higher-quality decision boundaries.
To assess the statistical reliability of the observed performance gains under the feature selection setting, we further conduct a one-sided Wilcoxon signed-rank test on the F1-scores across the same 5-fold splits, with the alternative hypothesis that TabPFN outperforms the corresponding baseline. The results are summarized in Table 3. As shown in the table, TabPFN achieves statistically significant improvements over all compared baselines, including GradientBoosting, RandomForest, Logistic Regression, XGBoost, SVC, BPNN, FTTransformer, TabTransformer, and CatBoost, with a consistent test statistic \(W = 15.0\) and \(p = 0.03125\) in all pairwise comparisons. This corresponds to the smallest attainable p-value for a one-sided Wilcoxon test with five paired observations, indicating that TabPFN outperforms each baseline on all five folds without exception. Importantly, these statistically significant results are not driven by isolated folds or outliers, but rather by consistent and uniform improvements across all cross-validation splits. In particular, the significant gains over strong tree-based baselines such as GradientBoosting and CatBoost confirm that the observed advantages cannot be attributed solely to model capacity or hyperparameter tuning, but instead stem from the pre-trained probabilistic prior and in-context inference mechanism employed by TabPFN.
Overall, feature selection further amplifies the advantages of TabPFN, yielding higher accuracy, better calibrated Precision–Recall trade-offs, and consistently superior stability. Combined with the strong statistical significance observed across all baselines, these results provide robust evidence that TabPFN is a reliable and reproducible solution for cytotoxicity prediction in realistic, safety-critical settings.
External validation with feature selection
To evaluate the generalization ability of the proposed model in real-world scenarios, we performed external validation using a novel test set covering data from 2017 to 2022. This dataset is completely disjoint from the training data and simulates real-world deployment environments with varying time and distributions.
We first evaluated the model performance using standard evaluation metrics, including Accuracy, Precision, Recall, F1-score, and AUC-ROC. As shown in Table 4 TabPFN performed best across all metrics, achieving an Accuracy of 0.8564, Precision of 0.8169, Recall of 0.7825, an F1-score of 0.7971, and an AUC-ROC of 0.8938. These results demonstrate that TabPFN maintains strong predictive power even when transferred to previously unseen data distributions. TabPFN exhibits clear performance advantages over other strong baseline models. Compared to CatBoost, which ranked second overall, TabPFN improved Accuracy by 6.30 percentage points (0.8564 vs. 0.7934), Precision by 8.83 percentage points (0.8169 vs. 0.7286), Recall by 2.85 percentage points (0.7825 vs. 0.7540), F1-score by 5.84 percentage points (0.7971 vs. 0.7387), and AUC-ROC by 2.55 percentage points (0.8938 vs. 0.8683). Similarly, compared to XGBoost, TabPFN consistently outperformed across all metrics, with improvements of 5.31 percentage points in Accuracy (0.8564 vs. 0.8033), 7.73 percentage points in Precision (0.8169 vs. 0.7396), 2.04 percentage points in Recall (0.7825 vs. 0.7621), 4.81 percentage points in F1-score (0.7971 vs. 0.7490), and 4.09 percentage points in AUC-ROC (0.8938 vs. 0.8529).
It is worth noting that several baseline models exhibited an unbalanced Precision–Recall trade-off under distribution shift. For instance, Random Forest achieved relatively high Precision (0.7556) but substantially lower Recall (0.5912), indicating a conservative decision boundary that misses a considerable number of cytotoxic samples. In contrast, XGBoost showed stronger Recall (0.7621) at the expense of Precision (0.7396), suggesting an increased false-positive rate. CatBoost demonstrated a more balanced behavior across metrics, highlighting the effectiveness of gradient boosting methods for tabular classification, yet its overall performance remained consistently below that of TabPFN. Gradient Boosting and BPNN achieved moderate performance, while Logistic Regression and SVC showed marked degradation across most metrics, reflecting the limited capacity of linear or shallow models to capture complex nonlinear relationships under temporal and domain shifts. Transformer-based tabular models, including FTTransformer and TabTransformer, also underperformed TabPFN, suggesting that attention-based architectures trained from scratch are less robust than models leveraging strong probabilistic priors and in-context inference. Although TabPFN’s absolute performance decreased relative to internal validation, as expected under real-world domain shift, it preserved the best balance between Precision and Recall and achieved the highest F1-score and AUC-ROC among all evaluated methods. This robustness is particularly desirable for cytotoxicity prediction, where minimizing missed high-risk cases while avoiding excessive false positives is of critical importance.
Furthermore, we visualize the confusion matrices of different models to provide a more fine-grained comparison of their classification behavior on cytotoxic and non-cytotoxic samples. As shown in Fig. 5, clear performance differences emerge in how models trade off false positives and false negatives under external validation.
TabPFN exhibits the most balanced prediction behavior, correctly classifying 92.9% of non-cytotoxic samples while still recalling 63.6% of cytotoxic cases. This balance is consistent with its superior performance across Accuracy, Precision, Recall, F1-score, and AUC-ROC, and indicates that TabPFN effectively mitigates the common bias toward the majority class under distribution shift. By maintaining high specificity without excessively sacrificing sensitivity, TabPFN achieves a favorable compromise between reliability and risk control. Among the strong baselines, CatBoost and XGBoost follow behind TabPFN but exhibit different decision preferences. CatBoost achieves 83.2% Accuracy on non-cytotoxic samples and recalls 67.6% of cytotoxic cases, reflecting a relatively balanced yet still conservative decision boundary. In contrast, XGBoost sacrifices a small amount of non-cytotoxic Accuracy (84.4%) to achieve a comparable Recall of 68.0% on cytotoxic samples, indicating a slightly more aggressive detection strategy that increases false positives. Random Forest demonstrates a highly conservative classification behavior, correctly identifying 97.4% of non-cytotoxic samples but recalling only 20.9% of cytotoxic cases, resulting in a large number of false negatives. BPNN performs moderately, achieving 73.8% Accuracy on non-cytotoxic samples and 68.0% Recall on cytotoxic cases, but still shows reduced stability under temporal and distributional shifts. In contrast, Gradient Boost, SVC, and Logistic Regression strongly favor the majority class. Gradient Boost correctly classifies 83.1% of non-cytotoxic samples but recalls only 64.9% of cytotoxic cases. SVC exhibits an even stronger bias, achieving 95.9% Accuracy on non-cytotoxic samples while recalling merely 17.3% of cytotoxic cases. Logistic Regression follows a similar pattern, with 75.6% non-cytotoxic Accuracy and only 41.8% Recall for cytotoxic samples. These results highlight the tendency of linear or shallow models to collapse toward majority-class predictions when deployed under out-of-distribution conditions. Overall, models with stronger nonlinear modeling capacity, such as CatBoost and XGBoost, demonstrate improved sensitivity to cytotoxic samples. However, TabPFN stands out by achieving the most favorable balance between minimizing false negatives and avoiding excessive false positives, reinforcing its robustness under external distribution shift and its suitability for reliable cytotoxicity screening.
Finally, we visualize the ROC curves to further examine how effectively each classifier separates cytotoxic and non-cytotoxic samples across the full range of decision thresholds. As shown in Fig. 6, the ROC curves provide a threshold-independent view of model discrimination ability, revealing both the overall ranking of methods and their trade-offs between true positive and false positive rates under external validation.
Among all compared models, TabPFN exhibits the steepest ROC curve and the largest area under the curve, achieving an AUC of 0.8938. This indicates the strongest and most consistent class separability under distribution shift, confirming that TabPFN maintains high sensitivity without incurring excessive false positives across different operating points. The pronounced curvature toward the top-left corner further suggests that TabPFN can sustain favorable detection performance even under strict decision thresholds. CatBoost (AUC = 0.8683) and XGBoost (AUC = 0.8529) follow closely, demonstrating strong but comparatively weaker discrimination ability. Their ROC curves remain well above the diagonal reference, yet exhibit smaller margins than TabPFN, indicating reduced robustness in balancing sensitivity and specificity as decision thresholds vary. Gradient Boost (AUC = 0.8467) and Random Forest (AUC = 0.8175) form a second tier, offering reasonable class separability but with flatter curves that reflect more limited discrimination under threshold changes. The neural-network-based BPNN achieves an AUC of 0.7865, providing moderate discrimination but showing a slower increase in true positive rate at low false positive rates. In contrast, Logistic Regression (AUC = 0.6552) and SVC (AUC = 0.6798) perform substantially worse, with ROC curves that lie close to the diagonal, indicating weak discriminative power under external distribution shift. Transformer-based tabular models, including FTTransformer (AUC = 0.6729) and TabTransformer (AUC = 0.7210), also lag behind, suggesting that attention-based architectures trained from scratch are less effective than pre-trained probabilistic priors in this setting.
Overall, the ROC analysis provides a comprehensive and threshold-independent confirmation of the trends observed in scalar metrics and confusion matrices. TabPFN consistently dominates across the entire operating range, achieving superior discrimination at all thresholds and reinforcing its robustness and generalization capability under realistic deployment conditions.
Ablation study
Unlike conventional ablation studies that remove or replace individual architectural components of a proposed model, the goal of this section is to analyze the effectiveness of the key learning mechanisms underlying TabPFN in the context of our task. Since TabPFN is employed as an off-the-shelf pre-trained foundation model and its internal architecture is not explicitly modified, we conduct a mechanism-level ablation rather than a module-level ablation.
Specifically, we focus on two core design hypotheses of TabPFN: (i) its ability to learn informative dense representations for tabular data, and (ii) its reliance on in-context learning to adapt to downstream tasks at inference time. Accordingly, we evaluate (1) the geometric structure induced by TabPFN embeddings compared with discretization-based one-hot encodings, and (2) the sensitivity of predictive performance to the number of labeled context examples.
Effectiveness of dense embeddings
To demonstrate the effectiveness of dense embeddings learned by TabPFN, we visualize the clustering structure in the latent space induced by TabPFN embeddings and dummy one-hot representations on the training set. For the latter, continuous features are first discretized using quantile-based binning and then encoded via one-hot representations, serving as a commonly used non-parametric baseline for tabular data.
As shown in Fig. 7, TabPFN embeddings form two compact and well-separated manifolds, yielding a high Silhouette score of 0.5187. This indicates strong intrinsic cluster cohesion and inter-cluster separation in a purely unsupervised setting, despite the fact that class labels are not used during clustering. In contrast, dummy one-hot representations exhibit fragmented and heavily overlapping clusters, resulting in a near-zero Silhouette score of 0.0790, which suggests the absence of meaningful geometric structure in the induced feature space. These results demonstrate that TabPFN learns intrinsically well-structured dense representations that preserve semantic relationships between samples, whereas simple discretization-based encodings fail to induce coherent cluster geometry.
Effectiveness of in-context learning
To investigate the role of in-context learning in TabPFN, we fix the test set and vary the number of labeled context (support) examples provided at inference time, with \(n_{ctx}\in \{5,10,15,20,50\}\). For each setting, we repeat stratified sampling of the context set 20 times and report the mean and standard deviation of the evaluation metrics in Table 5. The results reveal a clear and consistent performance improvement as the number of context examples increases. Macro-F1 improves from 0.4386 at \(n_{ctx}=5\) to 0.5770 at \(n_{ctx}=50\), while Accuracy increases from 0.6349 to 0.6873. This trend indicates that TabPFN does not rely solely on its pre-trained prior, but can effectively exploit additional context examples to adapt its decision function to the target data distribution.
Notably, AUC-ROC exhibits the most substantial gain, increasing from 0.5706 to 0.7462 as \(n_{ctx}\) grows, accompanied by a pronounced reduction in variance (standard deviation decreasing from 0.0805 to 0.0379). Since AUC-ROC is threshold-independent, this improvement suggests that larger context sets enable more stable and discriminative ranking of class probabilities rather than merely shifting a classification threshold. Furthermore, both macro-Precision and macro-Recall consistently increase with larger \(n_{ctx}\) (from 0.5624and0.5073 to 0.6840and0.6023), indicating that the observed gains are not achieved by trading off one type of error for another.
Conclusion
In this work, we developed a pre-trained model based framework for the cellular toxicity prediction of silica nanoparticles. By systematically addressing key limitations in existing approaches, including data leakage from the evaluation stage features, issues with one-hot encoding of categorical variables, and poor generalization of traditional machine-learning models, we demonstrated a robust and efficient strategy for predicting nanoparticle cytotoxicity. Our framework leverages SHAP analysis to identify critical features and employs TabPFN with dense embeddings and in-context learning to enhance model generalization without retraining. Evaluation on literature-derived cytotoxicity datasets shows that our method achieves state-of-the-art predictive performance, effectively avoids data leakage, and maintains strong generalizability to novel nanoparticle formulations. This study highlights the potential of pre-trained, data-driven frameworks in accelerating the evaluation and rational design of silica-based nanomedicines.
Future work may extend this approach to other classes of nanomaterials and incorporate multi-omics data to further improve predictive Accuracy and mechanistic interpretability.
Data availability
The implementation of the proposal, dataset and the experimental results of the evaluation are open public at https: //github.com/AppleMax1992/pretrained_nanosilica.
References
Zhang, J. et al. Custom-design of multi-stimuli-responsive degradable silica nanoparticles for advanced cancer-specific chemotherapy. Small 20, 2400353 (2024).
Qiao, L. et al. A peptide-based subunit candidate vaccine against sars-cov-2 delivered by biodegradable mesoporous silica nanoparticles induced high humoral and cellular immunity in mice. Biomaterials Science 9, 7287–7296 (2021).
Wang, Y., Zhang, B., Ding, X. & Du, X. Dendritic mesoporous organosilica nanoparticles (dmons): Chemical composition, structural architecture, and promising applications. Nano Today 39, 101231 (2021).
Liao, Y. et al. Stimuli-responsive mesoporous silica nanoplatforms for smart antibacterial therapies: From single to combination strategies. Journal of Controlled Release 378, 60–91 (2025).
Li, X. et al. Ultrasound-activated precise sono-immunotherapy for breast cancer with reduced pulmonary fibrosis. Advanced Science 12, 2407609 (2025).
He, L., Javid Anbardan, Z., Habibovic, P. & van Rijt, S. Doxorubicin-and selenium-incorporated mesoporous silica nanoparticles as a combination therapy for osteosarcoma. ACS Applied Nano Materials 7, 25400–25411 (2024).
Lei, Q. et al. Sol-gel-based advanced porous silica materials for biomedical applications. Advanced Functional Materials 30, 1909539 (2020).
Khalbas, A. H., Albayati, T. M., Ali, N. S. & Salih, I. K. Drug loading methods and kinetic release models using of mesoporous silica nanoparticles as a drug delivery system: A review. South African Journal of Chemical Engineering (2024).
Saha, A., Mishra, P., Biswas, G. & Bhakta, S. Greening the pathways: a comprehensive review of sustainable synthesis strategies for silica nanoparticles and their diverse applications. RSC advances 14, 11197–11216 (2024).
Kim, I.-Y., Joachim, E., Choi, H. & Kim, K. Toxicity of silica nanoparticles depends on size, dose, and cell type. Nanomedicine: Nanotechnology, Biology and Medicine 11, 1407–1416 (2015).
Chen, L. et al. The toxicity of silica nanoparticles to the immune system. Nanomedicine 13, 1939–1962 (2018).
Croissant, J. G., Butler, K. S., Zink, J. I. & Brinker, C. J. Synthetic amorphous silica nanoparticles: toxicity, biomedical and environmental implications. Nature Reviews Materials 5, 886–909 (2020).
Concu, R., Kleandrova, V. V., Speck-Planche, A. & Cordeiro, M. N. D. Probing the toxicity of nanoparticles: a unified in silico machine learning model based on perturbation theory. Nanotoxicology 11, 891–906 (2017).
Ahmadi, M., Ayyoubzadeh, S. M. & Ghorbani-Bidkorpeh, F. Toxicity prediction of nanoparticles using machine learning approaches. Toxicology 501, 153697 (2024).
Martin, n et al. Evidence-based prediction of cellular toxicity for amorphous silica nanoparticles. ACS nano 17, 9987–9999 (2023).
Karthiga, R., Usha, G., Raju, N. & Narasimhan, K. Transfer learning based breast cancer classification using one-hot encoding technique. In 2021 international conference on artificial intelligence and smart systems (ICAIS), 115–120 (IEEE, 2021).
Apicella, A., Isgrò, F. & Prevete, R. Don’t push the button! exploring data leakage risks in machine learning and transfer learning. Artificial Intelligence Review 58, 1–58 (2025).
Liu, R. et al. In-context learning for zero-shot medical report generation. In Proceedings of the 32nd ACM international conference on multimedia, 8721–8730 (2024).
Hollmann, N. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025).
Gao, S., Zhou, H., Gao, Y. & Zhuang, X. Bayeseg: Bayesian modeling for medical image segmentation with interpretable generalizability. Medical image analysis 89, 102889 (2023).
Mahesh, T. et al. Adaboost ensemble methods using k-fold cross validation for survivability with the early detection of heart disease. Computational Intelligence and Neuroscience 2022, 9005278 (2022).
Thara, D. et al. Auto-detection of epileptic seizure events using deep neural network with different feature scaling techniques. Pattern Recognition Letters 128, 544–550 (2019).
Hossin, M. & Sulaiman, M. N. A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process 5, 1 (2015).
Aymaz, S. Boosting medical diagnostics with a novel gradient-based sample selection method. Computers in Biology and Medicine 182, 109165 (2024).
Vlachas, C. et al. Random forest classification algorithm for medical industry data. In SHS Web of Conferences, vol. 139, 03008 (EDP Sciences, 2022).
Schober, P. & Vetter, T. R. Logistic regression in medical research. Anesthesia & Analgesia 132, 365–366 (2021).
Zheng, J. et al. Metabolic syndrome prediction model using bayesian optimization and xgboost based on traditional chinese medicine features. Heliyon 9 (2023).
Khushi, M. et al. A comparative performance analysis of data resampling methods on imbalance medical data. Ieee Access 9, 109960–109975 (2021).
Ben, S. J., Dörner, M., Günther, M. P., von Känel, R. & Euler, S. Proof of concept: Predicting distress in cancer patients using back propagation neural network (bpnn). Heliyon 9 (2023).
Safaei, N. et al. E-catboost: An efficient machine learning framework for predicting icu mortality using the eicu collaborative research database. Plos one 17, e0262895 (2022).
Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Advances in neural information processing systems 34, 18932–18943 (2021).
Huang, X., Khetan, A., Cvitkovic, M. & Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv:2012.06678 (2020).
Funding
This work was supported by the National Natural Science Foundation of China International Cooperation Major Project: Small scale and Active motion of nonlinear Reactive Wave driven soft robot, approval number: 22120102001.
Author information
Authors and Affiliations
Contributions
Huixia Zhang: Methodology, Writing, Original draft, Editing. Jiajun Tong: Software, Writing - review & editing. Minmin Chen: Writing - review & editing. Xichuan Cao: Supervision, Writing - review & editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, H., Tong, J., Chen, M. et al. Boosting pre-trained model with silica nanoparticles cellular toxicity prediction. Sci Rep 16, 3848 (2026). https://doi.org/10.1038/s41598-025-33872-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-33872-0









