A large-scale human toxicogenomics resource for drug-induced liver injury prediction

Bergen, Volker; Kodella, Konstantia; Srikrishnan, Sreenath; Barrandon, Ornella; Anderson, Sara; Rogers-Grazado, Max; Fowler, Casey; Beyene, Hirit; Robichaud, Nicole; Fulton, Timothy; Lapchyk, Nina; Cortes, Mauricio; Plugis, Nick; Goddeeris, Matthew; Zamanighomi, Mahdi

doi:10.1038/s41467-025-65690-3

Download PDF

Article
Open access
Published: 13 November 2025

A large-scale human toxicogenomics resource for drug-induced liver injury prediction

Volker Bergen ORCID: orcid.org/0000-0003-2486-1473¹,
Konstantia Kodella¹,
Sreenath Srikrishnan ORCID: orcid.org/0009-0009-3930-0725¹,
Ornella Barrandon¹,
Sara Anderson¹,
Max Rogers-Grazado¹,
Casey Fowler¹,
Hirit Beyene¹,
Nicole Robichaud¹,
Timothy Fulton¹,
Nina Lapchyk¹,
Mauricio Cortes¹,
Nick Plugis¹,
Matthew Goddeeris¹ &
…
Mahdi Zamanighomi ORCID: orcid.org/0009-0009-2503-5545¹

Nature Communications volume 16, Article number: 9860 (2025) Cite this article

13k Accesses
2 Citations
87 Altmetric
Metrics details

Subjects

Abstract

Drug-Induced Liver Injury (DILI) remains one of the most critical challenges in drug development, causing patient safety concerns, clinical trial failures and drug withdrawals. We introduce ToxPredictor, a toxicogenomics framework combining RNA-seq data from primary human hepatocytes with pharmacokinetic data to predict dose-resolved DILI risks and safety margins. At its core is DILImap, an RNA-seq library tailored for DILI research, comprising 300 compounds at multiple concentrations. ToxPredictor achieves 88% sensitivity at 100% specificity in blind validation, outperforming state-of-the-art methods. It flagged recent phase III clinical failures, including Evobrutinib, TAK-875, and BMS-986142, overlooked by animal studies. Beyond prediction, ToxPredictor provides mechanistic insights into hepatotoxic pathways, enabling early de-risking and actionable safety decisions. Unlike single-endpoint readouts—even from 3D models—transcriptomics offers a multi-dimensional system-level view of hepatocyte responses, capable of detecting diverse DILI mechanisms not captured by conventional assays. Scalable, actionable, and integrated into a broader AI/ML drug discovery platform, this work establishes toxicogenomics as a promising tool for developing safer therapeutics and addressing one of the most pressing challenges in toxicology.

Hepatotoxicity evaluation method through multiple-factor analysis using human pluripotent stem cell derived hepatic organoids

Article Open access 28 March 2025

High-content imaging of human hepatic spheroids for researching the mechanism of duloxetine-induced hepatotoxicity

Article Open access 01 August 2022

Apoptotic signatures allow early and rapid screening of drug-induced liver injury to accelerate drug discovery

Article Open access 24 December 2025

Introduction

Drug-Induced Liver Injury (DILI) presents a poorly understood late-stage challenge in drug development, costing an estimated $350 million annually per pharmaceutical company¹. Its rarity and unpredictability in clinical populations, often less than 1 in 10,000 persons, hinder detection in clinical studies, masking its severity until post-market exposure. Animal models fail to identify about half of the pharmaceuticals that exhibit clinical DILI². This makes DILI a leading cause of drug candidate failure and market withdrawal, impeding the development of new therapies³.

DILI arises from complex, multifactorial mechanisms, involving dose-dependent intrinsic mechanisms and with current methods unpredictable idiosyncratic reactions^4,5 influenced by genetic predisposition, environment, and individual health status⁶. DILI involves various cellular disruptions including mitochondrial dysfunction, oxidative stress, bile acid imbalance, inhibition of specific enzymes or transporters, and reactive metabolites formation^7,8. However, the precise mechanisms and contributing factors are not fully delineated⁴ and the lack of biomarkers hampers early detection⁹. Pre-clinical methods, such as quantitative structure-activity relationship (QSAR) models, offer low specificity and binary predictions lacking mechanistic insights^10,11. In vitro models use diverse cell sources (e.g., HepG2, THLE, HepaRG cells, primary human hepatocytes) with endpoints ranging from cytotoxicity markers (e.g., LDH, ATP) to mechanistic assessments using high-content imaging (HCI) and multi-parametric strategies. 3D liver models aim to better mimic human tissue characteristics and in vivo responses¹². However, despite improved physiological relevance, these models remain constrained by low-dimensional readouts—typically a limited panel of markers such as ATP levels, LDH release, or imaging-based features—which fail to capture the full spectrum of molecular responses and often miss the mechanistic cause. This gap continues to result in late-stage drug withdrawals and clinical failures, highlighting the urgent need for more comprehensive and predictive DILI models⁹.

Inspired by the idea of viewing cells as complex systems, we adopt a machine learning-driven toxicogenomics approach to analyze how DILI-associated compounds alter gene expression, identifying early gene signatures indicative of liver injury. Encompassing the interplay of pathways in response to toxic compounds, we aim to decipher the molecular mechanisms underlying DILI. Utilizing resources like DILIrank¹³ and LiverTox¹⁴ for drug categorization into DILI positives and negatives, we employed the TG-GATES¹⁵ microarray database for an initial proof-of-concept, enabling us to accurately predict DILI with 62% sensitivity and 92% specificity in blind validation. Building on this initial success, we created DILImap, a purpose-built and significantly expanded transcriptomic library designed to capture a broader spectrum of DILI mechanisms. DILImap features full-transcriptome RNA-seq data from 300 compounds profiled at multiple concentrations in primary human hepatocytes (PHHs), making it, to the best of our knowledge, the largest toxicogenomics dataset available for DILI modeling.

ToxPredictor, our random forest-based machine learning model trained on DILImap, achieves 88% sensitivity (29/33 DILI positives) at 100% specificity (14/14 DILI negatives) in blind validation. It outperforms 20+ pre-clinical models in a head-to-head comparisons, including mechanistic assays^{16,17,18,19,20,21}, cytotoxicity markers^{22,23,24,25,26,27}, physicochemical properties²⁸, bioactivation²⁹, BSEP³⁰ approaches, and the latest in-silico models^31,32,33, effectively identifying DILI risks in drugs previously overlooked by traditional models. To the best of our knowledge, it is the first pre-clinical model to flag DILI risks in high-profile clinical failures, such as Evobrutinib, TAK-875 and BMS-986142, all recently withdrawn in phase III trials due to liver injury despite clean preclinical profiles. The model provides dose-resolved predictions and mechanistic insights, demonstrating its utility for prioritizing safer drug candidates.

Beyond generalization to unseen compounds, the model has a distinct edge in its mechanistic breadth. The model leverages the full transcriptomic landscape to detect DILI-related mechanisms such as mitochondrial dysfunction, oxidative stress, immune activation, and metabolic perturbation—often well before cell death. These advantages are particularly evident when compared to high-content 3D liver assays^26,34,35,36, which, while physiologically relevant, are typically constrained to low-dimensional viability or imaging endpoints. In head-to-head comparisons, our model uniquely identifies non-cytotoxic risks missed by 3D assays. This systems-level resolution enables more comprehensive and unbiased detection of toxic liabilities across diverse compound classes.

This work, integral to a broader AI/ML drug discovery platform, aims at enhancing predictive power and operational efficiency in drug development. It showcases that the shift from a single-target to a systems-level perspective holds great promise and positions machine learning in toxicogenomics as significant enhancement to existing methods. To advance the field and foster collaborative innovation, we have made our open-source model and validation data publicly available at dilimap.org, providing a powerful tool for de-risking drug candidates and setting the stage for a paradigm shift in safety evaluations.

Results

DILImap – a human toxicogenomics database for DILI modeling

We have created DILImap, a comprehensive RNA-seq library tailored for drug-induced liver injury (DILI) modeling, encompassing 300 compounds tested at four concentrations. As the most extensive toxicogenomics resource to date, DILImap includes a curated selection of DILI-positive and DILI-negative compounds that span a wide range of known DILI mechanisms, including well-documented liver-injuring drugs and idiosyncratic compounds with no characteristic signature (Fig. 1A).

**Fig. 1: DILImap enables accurate prediction of drug-induced liver injury (DILI).**

All compounds were screened in sandwich-cultured primary human hepatocytes (PHHs), the gold standard and most physiologically relevant in vitro model for liver toxicity, which preserve key hepatic functions such as metabolic activity and bile canaliculi formation^37,38. Each compound was tested in triplicate across six concentrations using lactate dehydrogenase (LDH) and Adenosine Triphosphate (ATP) cell viability assays. RNA-seq profiling was performed at four selected doses, spanning the pharmacologically relevant range from therapeutic plasma Cmax to the highest tolerated non-cytotoxic dose just below the IC₁₀ threshold (Supplementary Fig. S1).

We selected a 24-hour post-exposure time point, based on the trade-off between signal strength and cellular viability: earlier time points (e.g., 2 h or 8 h) yield weaker transcriptional responses³⁹, while longer incubations risk hepatocyte de-differentiation and RNA degradation⁴⁰. This strategy allowed us to capture early transcriptional responses without compromising RNA integrity. Marker analysis confirmed retention of hepatocyte identity at 24 hours. We further ensured data quality by including only wells with sufficient total RNA counts and low mitochondrial RNA content, indicating viable, transcriptionally active cells. This streamlined workflow—including automated solubility testing, viability screening, and IC₁₀-based dose selection—enabled us to profile 300 compounds in four months, including 110 drugs tested preclinically for the first time as part of a systematic benchmark (Supplementary Fig. S2; Methods). To support comprehensive benchmarking, we provide detailed annotations for each compound, including clinical DILI labels¹³, DILI mechanisms¹⁴, molecular information⁴¹, consensus plasma Cmax from various studies^{10,19,20,23,24,25,28,42,43}, and DILI classification results from over 20 pre-clinical studies (Supplementary Data S1–S4).

Compounds were categorized based on DILIrank¹³ and LiverTox¹⁴ as follows (Supplementary Fig. S3):

Withdrawn DILI: withdrawals or clinical trial failures due to DILI (Most-DILI-Concern; withdrawn).
Known DILI: compounds with well-established clinical DILI risk (Most-DILI-Concern or LiverTox score A).
Likely DILI: drugs with documented liver injury cases (Most-DILI-Concern or LiverTox score A/B).
Idiosyncratic DILI: rare cases without clear dose–response (LiverTox score C/D, <12 case reports).
Unlikely DILI: discordant or weak evidence across databases (Less-DILI-concern, but LiverTox score E).
No DILI: compounds with no documented hepatotoxicity (No-DILI-Concern; LiverTox score E).

Withdrawn, Known, and Likely DILI serve as positive controls; No DILI as negative controls; while Idiosyncratic and Unlikely DILI, due to their label ambiguity, are excluded from training and reserved for downstream testing.

The training dataset includes 249 compounds (111 DILI + , 52 DILI-, 17 unlikely DILI, 69 idiosyncratic DILI) for cross-validation. For blind validation, a separate experiment was conducted using an independent set of 51 compounds (33 DILI + , 14 DILI-, and 4 with unknown labels, including real-world clinical failures). This carefully curated dataset provides a robust foundation for predictive modeling and mechanistic insights into DILI.

ToxPredictor – pathway-level toxicogenomics predicts DILI risk and therapeutic safety margins

ToxPredictor, a machine learning model trained on our DILImap library, predicts DILI risk from pathway-level transcriptional signatures. These signatures are derived through enrichment analysis (WikiPathways⁴⁴, FDR-adjusted p-values) of genes differentially expressed between compound- vs. DMSO-treated samples using DESeq2⁴⁵, computed for each dose of every compound in DILImap (Fig. 1B; see Methods).

For model training, we used only compounds with unambiguous DILI labels, resulting in a high-confidence training set of 111 DILI+ (Withdrawn, Known, Likely) and 52 DILI− (No DILI), while the remaining training data were held out to assess model robustness. To ensure high-confidence DILI labels, we further restricted training to drug concentrations tested at more than 20x of their clinical Cmax to reduce the risk of false-negative labeling for DILI+ compounds that may appear safe at lower doses. For 5-fold cross-validation, we applied stratified, compound-level splitting to ensure that all doses and replicates of a given compound were held out together in each fold, mimicking real-world generalization to unseen compounds. From 193 tested configurations across eight model classes, we selected a Random Forest classifier for its strong validation AUC, minimal overfitting, and highest consistency across folds. These properties, combined with its interpretability, motivated its choice over more complex boosting and deep learning models (Supplementary Fig. S4).

The final model is an ensemble of 30 Random Forest models (ensemble members) trained on different cross-validation splits, which together enhance generalization and prediction stability. The ensemble size was chosen based on empirical benchmarking that showed stable test AUC and consistency between models (Supplementary Fig. S5). By estimating DILI probabilities across dose levels, ToxPredictor enables calculation of drug safety margins, defined as the ratio between the first predicted DILI dose (i.e., the lowest dose with predicted probability >0.7) and the maximum plasma concentration (Cmax) at therapeutic levels. This provides a transcriptomics-based surrogate of the clinical therapeutic window. A safety margin threshold of 80 provides an actionable classification into high- and low-risk compounds. The probability threshold of 0.7 and margin of safety (MOS) cutoff of 80 were both optimized on held-out training data to reach performance plateaus while minimizing false positives. Our selected MOS threshold of 80, while on the higher end of literature-reported ranges (10–100)^{17,23,28,34,36}, reflects the greater sensitivity of transcriptomic assays compared to cytotoxicity or mechanistic readouts. Since transcriptional changes often occur at lower doses—before overt toxicity—a higher cutoff is needed to avoid false positives and maintain high specificity in a transcriptome-based model (Supplementary Fig. S5). Among available exposure measures, we used total Cmax instead of free Cmax due to its broader availability across compounds. Both measures showed comparable predictive performance, with total Cmax performing slightly better, possibly due to more robust consensus values derived from a greater number of studies (Supplementary Fig. S6).

All model selection, hyperparameter tuning, and threshold optimization were performed exclusively on the training data. For final evaluation, we used a fully independent blind-validation set of 51 compounds (33 DILI + , 14 DILI − , and 4 unknowns), profiled in a separate experiment using separate plates and sequencing runs. This set was withheld from all stages of model development. Compound selection for the validation study was finalized prior to training and intentionally enriched for withdrawals and recent clinical failures. The four unknowns represent compounds currently in clinical use or trials without confirmed DILI liability (Supplementary Table S1).

In blind validation, the model achieved 88% sensitivity, correctly identifying 29 of the 33 DILI+ compounds, and 100% specificity, with all 14 DILI− compounds correctly classified as safe (Fig. 1C).

Identifying withdrawn and idiosyncratic DILIs previously overlooked in animal and clinical studies

These results represent a substantial improvement over our initial proof-of-concept model trained on TG-GATES microarray data, which achieved 62% sensitivity and 92% specificity. Leveraging our DILImap library, ToxPredictor substantially improved both sensitivity of 88% and specificity of 100% on the same validation set (Fig. 2A). In cross-validation of the entire library, the model identified 110 out of 144 DILIs—surpassing the previous 62 out of 144 with TG-GATES—and misclassified only 8 out of 66 non-DILIs (Fig. 2B, Suppl. Figure S7). This enhancement is attributed to DILImap’s larger dataset with broader mechanistic coverage and the higher resolution of RNA-seq over microarrays, enabling better gene detection and a wider quantitative range for expression level changes compared to microarrays⁴⁶. Notably, post-market withdrawals, missed in both pre-clinical models and clinical trials, were most confidently flagged by our model as high DILI risk.

**Fig. 2: Transcriptomics-derived safety margins accurately identify DILI risk, including withdrawn compounds.**

Our DILI safety margin and classification is derived from three parameters: Cmax (baseline concentration), cell viability assays (indicating cell death), and transcriptomics (based on differential pathways). To assess each parameter’s contribution, we assessed their ability to classify DILI cases independently. Out of 144 DILIs, 29 were detected solely based on plasma Cmax (>25 μM) and 42 through the LDH cytotoxicity assay (safety margin <80), both at ≥90% specificity. Combining our transcriptomics-based model with Cmax and LDH data was most effective, identifying 110 out of 142 DILIs (safety margin <80), underscoring the added value of toxicogenomics in DILI detection beyond mere cell death (Supplementary Fig. S8). ToxPredictor achieved a ROCAUC of 0.82 in cross-validation, compared to 0.66 for viability alone. In blind validation, it achieved a ROCAUC of 0.96, compared to 0.65 for using viability alone (Supplementary Fig. S9).

Mechanistic dissection of NSAIDS with shared targets but different DILI profiles

Our model highlights distinct DILI profiles among closely related COX-2 inhibitor non-steroidal anti-inflammatory drugs (NSAIDs) and imparts unique mechanistic insights linking predictions to mechanisms such as hepatocellular injury, oxidative stress, and mitochondrial dysfunction. For instance, Valdecoxib, used for cancer pain, shows no DILI risk (Fig. 3A), while Sulindac, an arthritis treatment with rare but established idiosyncratic DILI cases, and Lumiracoxib, withdrawn due to severe liver failures, are flagged as DILI risks with safety margins below the classification threshold of 80 (Fig. 3B).

**Fig. 3: Three Cox2-inhibitors with the same target but distinct DILI profiles.**

Crucially, it highlights the pathways implicated in DILI, encompassing direct contributors like oxidative stress leading to cell injury, as well as indirect factors such as disturbances in fatty acid metabolism, which can be particularly relevant to explain idiosyncratic effects (Suppl. Table S2). Sulindac, for example, is linked to disruptions in fatty acid synthesis and cholesterol biosynthesis, aligning with recent studies connecting it to hepatic steatosis⁴⁷. By pinpointing these pathways, the model provides mechanistic insights into idiosyncratic DILI, offering an understanding previously thought unpredictable (Fig. 3C).

The model assesses DILI risks in a dose-resolved manner, revealing how dosage impacts liver injury likelihood. A key demonstration is its accurate prediction of Sulindac’s DILI risk, despite it being believed to be unpredictable in a dose-resolved manner⁴⁸. These results highlight the model’s ability to deliver actionable predictions and enable targeted optimization of drug safety profiles by focusing on critical pathways (Fig. 3D).

Known and novel genes and pathways associated with DILI risk

DILI arises from disruptions in diverse pathways. Our model highlights pathways with high predictive value (AUC ≈ 0.8) strongly associated with DILI risk, including amino acid metabolism (toxic metabolite buildup causing oxidative stress and liver injury), fatty acid biosynthesis (disruptions leading to lipid accumulation and hepatocyte damage), tryptophan metabolism (toxic intermediates driving oxidative stress and inflammation) and ferroptosis (iron-dependent oxidative stress leading to lipid peroxide accumulation)^{49,50,51,52,53}. Additionally, pathway activations highly correlated with predicted DILI risk include nuclear receptor signaling (e.g., PXR/CAR/FXR), one-carbon metabolism, and bile acid regulation—highlighting transcriptional reprogramming and metabolic stress as key contributors to hepatotoxicity^54,55,56 (Fig. 4A). When compounds are ranked by predicted DILI probabilities, a clear gradient of pathway activation emerges, revealing distinct enrichment patterns for these biological processes. This correlation reinforces the direct mechanistic relevance and interpretability of the model’s predictions and highlights these pathways as potential drivers of DILI (Fig. 4B).

**Fig. 4: Key pathway activations and genes frequently implicated in DILI.**

To identify genes most significant for DILI, we determined the frequency at which each gene was differentially up- or downregulated across DILI drugs in our library, using an adjusted p-value threshold of 0.05. This analysis focused on the concentrations at which toxic effects were first predicted, aiming to uncover early upstream regulators potentially driving DILI. Most frequently up-regulated genes were associated with drug metabolism, transport, stress response, and lipid metabolism. Novel genes linked to inflammation, autophagy, and mitochondrial dysfunction were also implicated (Fig. 4C). Frequently down-regulated genes include those critical for liver functions such as drug metabolism, transport, lipid metabolism, amino acid metabolism, mitochondrial function, coagulation and inflammatory responses. Altered expression of these genes may serve as early indicators of liver injury and reflect DILI’s multifaceted mechanisms (Fig. 4D).

Establishing that the DILI pathways and genes identified by our model are specific to liver toxicity rather than general toxicity is inherently challenging. However, the model’s precision is evident in its accurate classification of non-DILI compounds with known toxicities in other systems, such as Valdecoxib (cardiovascular toxicity), Bupropion (neurologic and cardiovascular toxicity), and Warfarin (hematologic toxicity)^57,58,59, indicating the model’s ability to distinguish liver-specific toxicity from other forms of organ damage.

ToxPredictor accurately flags DILI risks in recent clinical failures and provides dose recommendations

Bruton tyrosine kinase (BTK) inhibitors, despite their promise in oncology and autoimmune diseases, have faced clinical holds due to liver injury. Recent examples include Evobrutinib, BMS-986142, and Orelabrutinib, all of which were withdrawn or put on hold in phase III in 2023 due to DILI cases.

We validated ToxPredictor on four clinical failures: Evobrutinib, BMS-986142, Orelabrutinib, and TAK-875 (type 2 diabetes drug), along with two investigational BTK inhibitors (Rilzabrutinib, Remibrutinib) and two FDA-approved JAK inhibitors (Tofacitinib, Upadacitinib) as negative controls. DILI risk probabilities were assessed at four concentrations and margins of safety (MOS) estimated to classify compounds as high (MOS ≤ 2.5), mid-high (MOS ≤ 12.5), medium (MOS ≤ 80), or low risk (MOS > 80). All clinical failures were flagged as high or medium-high risk with low MOS values, particularly TAK-875, Evobrutinib, and BMS-986142, consistent with their phase III withdrawals. The investigational drugs were classified as medium risk (MOS = 14) and low risk (MOS = 101), which have not yet been linked to DILI in clinical studies yet, while the DILI-negative JAK inhibitors were classified as low risk. These results align closely with their clinical safety profiles (Fig. 5A).

**Fig. 5: Real-world applicability in flagging DILI risk in recent clinical failures.**

ToxPredictor provides dose-dependent DILI risk curves derived from empirical DILI likelihoods across various hypothetical Cmax values, enabling safe dosing recommendations (see Methods). For instance, Rilzabrutinib is categorized as low risk at doses below 100 mg q.d., which is lower than its efficacious dose of 400 mg. In contrast, Remibrutinib’s efficacious dose of 100 mg falls within the recommended low-risk range of <155 mg q.d. These findings highlight ToxPredictor’s value in informing safe dosing strategies, making it a valuable tool for de-risking new drug candidates (Fig. 5B).

Mapping toxicogenomics in the competitive landscape of existing pre-clinical DILI models

We benchmark our model along two key axes: predictive performance and scalability. Predictive performance, measured by balanced accuracy, reflects the model’s ability to distinguish DILI-positive from DILI-negative compounds. Scalability captures both technical throughput and biological breadth—the capacity to generalize across diverse chemistries and mechanisms, including previously uncharacterized ones (Fig. 6A).

**Fig. 6: ToxPredictor outperforms state-of-the-art prediction models in accuracy and scalability.**

Our model outperforms a wide range of pre-clinical DILI models, including mechanistic assays^{16,17,18,19,20,21}, cytotoxicity markers^{22,23,24,25,26,27}, physicochemical properties²⁸, bioactivation²⁹ and BSEP³⁰ approaches. In a head-to-head comparison across matched compound sets, it identified 46 out of 66 DILI cases versus the 27 out of 66 identified by Xu et al.¹⁷ HCI assay (49/66 vs 27/66); it shows superior performance over Garside et al.²⁰ HCI assay (37/46 vs 29/46 DILIs), Vorrink et al.²⁶ cytotoxicity assay using CD spheroids (37/43 vs 30/43 DILIs), Sakatis et al.²⁹ bioactivation endpoint GSH adduct (47/65 vs 25/65) as well as their combined assay integrating covalent binding and dose (47/65 vs 32/65). When compared to Kohonen et al.’s transcriptomics-based cytotoxicity model⁶⁰, our approach showed improved sensitivity (26/36 vs. 16/36 DILIs). These comparisons, all at 100% specificity evaluated on the same compounds, underscore the added value of our systems-level, mechanism-agnostic readout (Fig. 6B; Supplementary Table S3).

Structure-based in silico models such as TxGemma³¹, DILIGeNN³² and DILIPredictor³³ underperform in vitro-based approaches in our benchmark. To assess real-world generalizability, we evaluated them on 314 independent compounds (45 DILI + , 269 DILI − ) primarily annotated via LiverTox (scores A/B as DILI + , E as DILI − ); TxGemma was also tested on an expanded set (143 DILI + , 536 DILI − ). All showed limited specificity: DILIGeNN (84% sensitivity, 28% specificity), DILIPredictor (80% sensitivity, 29% specificity), and TxGemma-27B (57% sensitivity, 37% specificity). These findings are slightly below the balanced accuracy of 0.59 reported by Seal et al. (2024)³³ for DILIPredictor. On a benchmark subset of unseen compounds overlapping with DILImap (n = 97), TxGemma reached 63% sensitivity (39/62) and 57% specificity (20/35), while our model achieved 76% sensitivity (47/62) and 86% specificity (30/35). Similarly, DILIGeNN showed perfect sensitivity (5/5) at moderate specificity (2/3), while our model reached 100% on both (5/5 and 3/3). DILIPredictor reached complete sensitivity (23/23) but at the expense of poor specificity (1/7), while ToxPredictor maintained high sensitivity (20/23) at markedly higher specificity (5/7). Low specificity is a key limitation of structure-based models, which lack biological context and tend to over-call toxicity. This results in false positives for commonly prescribed drugs with no risk of hepatotoxicity, such as biotin (flagged DILI+ by DILIPredictor), vitamin D (flagged DILI+ by DILIGeNN), and pemetrexed (flagged DILI+ by all three models). Moreover, they provide only binary outputs, without dose or mechanistic insight. In contrast, transcriptomics enables dose-resolved predictions, mechanistic interpretability, and safety margin estimation—critical for evaluating toxic liabilities and guiding follow-up experiments (Supplementary Fig. S10; Supplementary Data S4).

3D liver systems offer important physiological context. High-content imaging in 3D models, such as those by Walker et al.³⁴ and Ewart et al.³⁵, achieves similar performance on small, curated panels (Walker: 23/27 vs. 23/27; Ewart: 11/14 vs. 13/14). However, their limited scalability constrains their utility to a broader range of DILI mechanisms. They may perform well on narrow, curated panels, but struggle with unknown mechanisms or mechanisms not captured by the low-dimensional endpoint, as shown in the following comparison. To explore the unique capabilities of 2D transcriptomics vs 3D cytotoxicity assays, we conducted direct compound-level comparisons with larger 3D screening studies: Vorrink et al.²⁶ and Fäs et al.³⁶ In Vorrink et al. 3D cytotoxicity uniquely detected 3 compounds (Fialuridine, Methotrexate, Trazodone), all linked to cytotoxic effects that result in acute cell death. Conversely, our model uniquely flagged 10 compounds—including fluconazole, phenytoin, and zileuton—associated with immune activation, metabolic stress, or enzyme modulation, which are not readily captured by viability endpoints. A similar pattern emerged in the Fäs et al. study: 3D cytotoxicity exclusively identified 4 compounds (e.g., Haloperidol, Fialuridine) whose toxicities depend on structural or metabolic context. Our model uniquely identified 5 compounds (e.g., Cimetidine, Fluconazole, Ximelagatran) marked by subtle transcriptional responses rather than overt cell death. These comparisons highlight a key limitation of fixed single-endpoint models: while effective in narrow contexts, they struggle with broader chemical and mechanistic diversity. Our transcriptomic approach, by contrast, offers systems-level resolution that generalizes across DILI pathways—not only detecting known cytotoxic responses but also uncovering less immediate, non-lethal mechanisms often missed by traditional assays (Supplementary Data S4).

As a result of its unbiased modeling, our approach shows improved detection of idiosyncratic compounds—a class of toxicities that often escape detection in targeted or phenotypically narrow assays. These compounds, many of which are associated with extremely rare clinical incidence (<12 case reports), present a significant challenge for preclinical screening. Our model identified 29 out of 65 of those cases (44%) the highest detection rate among all evaluated models, while maintaining a specificity of 88% (Supplementary Fig. S11A).

Next, we analyzed how combining toxicogenomics with orthogonal assays further enhances detection. The three most effective combinations include pairing our model with Walker et al.’s 3D-based HCI assay³⁴ to improve DILI detection from 23/27 to 26/27 cases, pairing with Persson et al.’s 2D-based HCI assay¹⁹ to improve DILI detection from 28/37 to 30/37 cases, and pairing with Sakatis et al. GSH depletion assay²⁹ to increase detection from 47/65 to 53/65. Such strategic combinations could raise balanced accuracy to as high as 98% (Supplementary Fig. S11B).

Based on these insights, we propose a tiered de-risking funnel strategy that begins with straightforward endpoints, such as PK data (e.g., Cmax <25 μM) and cytotoxicity assays, to flag overt hepatotoxicity. For candidates showing no early toxicity signals, toxicogenomics provides the most comprehensive and unbiased assessment of DILI risk—capturing both known and novel mechanisms. For a select few advanced candidates, with sufficient resources, applying toxicogenomics in advanced 3D liver models may offer the most accurate prediction of in vivo responses. This strategy ensures a resource-efficient and mechanistically broad DILI risk assessment in drug development.

Discussion

Our toxicogenomics approach offers a comprehensive and unbiased perspective on cellular responses, providing rich information for a nuanced understanding of liver toxicity, including idiosyncratic reactions with unknown mechanisms. By shifting from single-target analyses to a systems-level viewpoint, we demonstrate that applying machine learning to toxicogenomics holds great promise as a significant enhancement over existing toxicology methods. It enables dose-specific predictions and safety margins, moving beyond binary DILI/No-DILI assessments. Its extensive mechanistic coverage presents a substantial advantage over current pre-clinical models. By creating a comprehensive toxicogenomics library specifically designed for DILI research, we achieved notable improvements in predicting DILI risks compared to state-of-the-art methods. The high sensitivity (88%) and specificity (100%) obtained in blind validation, along with the identification of DILI in withdrawn drugs and clinical failures overlooked in animal and clinical studies, underscore its practical utility in early drug development phases. We consider this an important step forward in predictive toxicology, positioning our approach as a significant advancement in the field. The scalability and adaptability of our method, a central component of a larger AI/ML platform, are designed to enhance the predictive power and efficiency of drug development pipelines.

While our approach represents a meaningful step forward, it is important to acknowledge the inherent limitations of our method. First, the multifactorial nature of DILI, involving genetic, environmental, and lifestyle factors, means even the most advanced models cannot capture the full spectrum of potential mechanisms. Our approach, though comprehensive, does not account for all interindividual variability or rare genetic predispositions contributing to DILI risk. Second, the predictive power of our model is constrained by the completeness of the DILImap library. Gaps in the database, particularly in relation to poorly documented idiosyncratic reactions, can limit the model’s accuracy. Third, our reliance on a 2D hepatocyte culture system may fail to replicate the complex interactions between hepatocytes and other cell types or tissues that can drive certain DILI mechanisms. Fourth, the exposure timepoint of 24 hours limit the detection of delayed or immune-mediated toxicities. Fifth, as with any preclinical model, the ultimate test of its utility lies in its ability to predict clinical outcomes, a domain where uncertainties and unpredicted variables can significantly impact performance.

Looking ahead, toxicogenomics holds tremendous promise for advancing our understanding and prediction of DILI. Integrating RNA-seq with advanced 3D culture systems—such as spheroids and liver-on-chip platforms—may enable longer drug exposures and the inclusion of non-parenchymal cells like Kupffer cells and hepatic stellate cells. These co-culture models are essential for capturing immune activation, inflammation, and fibrosis—hallmarks of idiosyncratic and chronic DILI that hepatocyte monocultures cannot recapitulate. As long 3D models rely on high-content imaging or cytotoxicity endpoints that capture specific phenotypic responses they remain limited in mechanistic scope and scalability. Their performance is often demonstrated on small, curated compound sets with known mechanisms, which may not translate to broader chemical space. In contrast, transcriptomics provides a scalable, unbiased readout capable of detecting diverse DILI mechanisms, including those not captured by existing assays. As RNA-seq becomes more feasible in physiologically relevant 3D systems, we anticipate a powerful synergy—combining the mechanistic breadth of transcriptomics with the physiological relevance of 3D models. Our work lays the foundation for such integration, demonstrating that transcriptomics alone can robustly capture DILI risk across a wide range of mechanisms.

Combining RNA-seq data with structural information of molecules could enable a deeper understanding of the interactions between drugs and cellular components, facilitating more accurate predictions of toxicity. Using early estimates as surrogate for plasma Cmax, such as target activity, could help derive safety margins earlier in the drug discovery process. The development of multivariate models that include DILI regulatory networks represents another exciting frontier. Such models can incorporate the complex interplay of genes, proteins, and metabolic pathways involved in DILI. Furthermore, exploring additional data types, such as chromatin accessibility, proteomics, or metabolomics, could yield further insights into DILI mechanisms.

As these advanced models and datasets become integrated into the early stages of drug development, we anticipate a decrease in liver-related adverse events, improved efficiency in drug development, significant cost savings, and, most importantly, enhanced patient safety. The ongoing evolution of toxicogenomics approaches, bolstered by advancements in computational machine learning methods and multi-omics technologies, marks an important step toward more predictive drug safety evaluation—one that has the potential to support the development of safer therapeutics and improved patient outcomes.

Methods

Experimental workflow

We employed a systematic, high-throughput approach using cryopreserved primary human hepatocytes (PHHs) to study transcriptional changes underlying drug-induced liver injury (DILI).

Human tissue sourcing and ethical compliance

Cryopreserved PHHs were obtained from LifeNet Health (Virginia Beach, VA, USA) under standard provider agreements. LifeNet Health procured tissues under informed donor consent and Institutional Review Board (IRB) approval in accordance with U.S. regulations. Cellarity did not create any new cell lines for this publication.

Cell culture overview

Cryopreserved PHHs were selected for high viability (>95%), long-term plateability (10–15 days), compatibility with 96-well formats, and Grade A quality. Donor 1917277-01, a 37-year-old Caucasian female (BMI 26), was used. Hepatocytes were cultured in collagen I-coated 96-well plates and maintained in a sandwich configuration (collagen base with a 100 μg/mL Matrigel overlay) to preserve hepatocyte function. Cells were matured for three days with daily media changes before compound treatment. After 24 h of treatment, cells were either assayed for viability (LDH/ATP) or lysed for RNA extraction and sequencing (Supplementary Fig. S1A).

Cell culture protocol

Human hepatocyte thawing media, human hepatocyte culture media, HHCM supplement, human hepatocyte plating media, HHPM supplement (LifeNet Health) and Pen/Strep (Gibco) were thawed and filtered before plating. PHHs were counted using a Luna-FL Cell Counter and Acridine Orange/Propidium Iodide Stain (Logo Bio). Cells were plated at a cell density of 0.5 million cells / mL in collagen I-coated 96-well plates (Gibco). Cells were left to attach in the incubator for 6 hours and then replaced with maintenance medium (LifeNet Health). The next day, cells were overlayed with a thin coat of Matrigel and left to incubate for an additional day with a daily media change to allow for full maturation of cells. On Day 3, cells were treated with compounds at multiple concentrations for 24 h and then were either taken down for either viability testing (LDH and ATP readouts) or lysed for RNA sequencing.

Experimental setup

The objective of this experiment was to screen 300 compounds to create a hepatotoxicity intervention library for building a predictive DILI model. PHHs were first screened for maximum tolerated dose (MTD) in six-point log curves (0.01 μM to 1 mM, Figure. S1B). MTD was defined as the highest concentration before observing >10% cell death in the LDH assay. RNA sequencing was conducted on cells treated with compound concentrations ranging from Cmax to MTD to evaluate the safety margin between therapeutic and toxic doses.

Compound dissolution and plate preparation

All compounds were purchased from MedChemExpress (MCE). Compound preparation was performed in-house. Compound dissolution was prepared manually fresh on the day of compound treatment to avoid freeze/thaw cycles. In a previous DMSO tolerance test, 0.5% of DMSO in compound was dictated as the ideal concentration. Compound dissolution with DMSO started at a highest concentration of 1000 µM. If compounds were not solubilized at the 200x stock, dissolution was attempted at 100x and 50x concentrations. The compounds were added to a barcoded plate and transferred to the Hamilton MicroLab Star liquid handler for titration (Fig. S1B). The plate layout for titration varied between toxicity screens and RNA sequencing runs (Fig. S1C). Automation was used to prepare compound dilution and compound treatment.

Lactate dehydrogenase (LDH) viability assay

As primary viability assay, we employed the non-radioactive cytotoxicity assay from Promega, which utilizes lactate dehydrogenase (LDH) as a marker for cell death. For reagent preparation, we thawed LDH buffer at 4 °C overnight and used it to reconstitute the LDH substrate bottles by adding 12 mL, which were then stored at –20 °C. We utilized designated untreated cells to establish the 100% lysate positive control, against which we normalized all subsequently treated cells for % viability calculations. To these designated cell wells, we added 10x Lysis buffer in a 1:10 volume ratio. We then mixed 50 μL of the collected sample with 50 μL of LDH substrate in a flat-bottom tissue culture plate. Plates were covered by and incubated at room temperature for 30 min. Following the incubation, we added 50 μL of thawed Stop solution and used a SpectraMax i3x plate reader with absorbance settings at 490 nm to read the plate.

CellTiter-Glo (CTG) viability assay

Another orthogonal viability assay that was utilized for toxicity screening was CellTiter-Glo (CTG) Luminescent Cell Viability Assay (Promega) which determined the ATP content within the wells. The CTG viability assay is a terminal endpoint due to the lysis of cells. To prepare the reagent, the CTG buffer and substrate were thawed at room temperature and 10 mL of the buffer was resuspended in the substrate and stored at –20 °C freezer. CTG aliquots were thawed the day of the takedown, and DPBS (Gibco) was added to CTG substrate at a 1:1 ratio. Following the removal of the contents of the culture plate, 100 uL of CTG and PBS mix was added across all the wells of the plate. Following the lysis, plates were covered with aluminum foil and set on an orbital shaker for 2 min at 400 rpm. Following the 2 min, the plates remained covered and were left at room temperature for 10 minutes. The plates were then read on the SpectraMax i3x using the luminescence and Standard Opaque settings.

Cell lysis for RNA extraction

RLT buffer (Qiagen) and 2-mercaptoethanol (Thermofisher Scientific) were prepared to create a RLT + 1% BME reagent. 140 uL of lysis buffer was added across all wells of the plate using a multichannel pipette. After ensuring the cells were lysed under a microscope, the 140 uL of lysate was transferred to an Eppendorf twin.tec 96-well PCR plate (Fisher Scientific) and placed into a −80 °C freezer.

Library preparation and sequencing (SMART-Seq)

Total RNA was prepared from cell lysates in 96-well plates using a QIAcube HT robotic workstation (Qiagen) in conjunction with RNeasy 96 QIAcube HT kits (Qiagen) according to the manufacturer’s recommended protocol. SMART-Seq DE3 libraries were prepared from total RNA according to the manufacturer’s protocol. Briefly, for each row of the plate, polyadenylated mRNA was selected using uniquely barcoded oligo(dT) primers. First strand cDNA was generated via reverse transcription; double-stranded cDNA was created via template switching with limited cycles of PCR amplification. Each row of samples was then pooled and subjected to transposon-based fragmentation using the Nextera XT DNA Library Preparation Kit (Illumina). Libraries were then PCR amplified using unique combinations of Illumina P5 and P7 barcodes and mixed in equimolar pools prior to sequencing. Sequencing was performed on an Illumina NovaSeq6000 using custom read lengths of 89 bp (Read1) and 26 bp (Read2).

Computational workflow

Cmax annotations

Cmax values were compiled from multiple resources, and the median of these values was used to derive a consensus total Cmax. The following resources contributed to computing the consensus Cmax: Drug information from the National Center for Advancing Translational Sciences (NCATS), Porceddu et al. (2012), Khetani et al. (2013), Persson et al. (2013), Aleo et al. (2014), Garside et al. (2014), Gustafsson et al. (2014), Chen et al. (2014), Shah et al. (2015), Camenisch et al. (2019), Dixit et al. (2019), Aleo et al. (2020), Williams et al. (2020), and Smit et al. (2020). For compounds where no studies provided reliable Cmax estimates or where estimates varied significantly across studies, additional manual annotations were performed by searching PubChem and relevant clinical studies.

DILI categorization

DILIrank and LiverTox serve as key resources for categorizing drugs as DILI positive or negative. DILIrank, developed by the FDA, classifies over 1000 drugs into four levels of DILI concern — most, less, no concern, and undetermined — based on historical liver injury data. LiverTox, created by the NIDDK, is an exhaustive online database with detailed information on these drugs, including their clinical manifestations, mechanisms of action and likelihood of causing DILI. It also offers a system to assess the likelihood of DILI, from well-known cases to idiosyncratic drugs without a characteristic signature to drugs considered safe. These categorizations serve as DILI endpoints for the development and refinement of our model. Compounds were systematically categorized based on DILIrank and LiverTox into the following categories:

Withdrawn DILI: Market withdrawals and clinical failures due to DILI (Most-DILI-Concern; e.g., Troglitazone).

Known DILI: Compounds with well-established DILI risk (Most-DILI-Concern in DILIrank or LiverTox score A; e.g., Isoniazid at therapeutic doses, Acetaminophen at overdose levels).
Likely DILI: Compounds with documented cases of DILI in specific contexts (Most-DILI-Concern or LiverTox score A/B; e.g., Progesterone).
Idiosyncratic DILI: Rare, unpredictable DILI without clear dose-response (LiverTox score C/D; <12 case reports).
Unlikely DILI: Discordantly labeled safe in LiverTox (score E) but Less-DILI-concern in DILIrank.
No DILI: Compounds with no clinical or preclinical documented evidence of DILI (No-DILI-Concern in DILIrank; LiverTox score E).

The categories Withdrawn DILI, Known DILI, and Likely DILI serve as positive controls and No DILI as negative control. Our models were trained to differentiate between positive and negative control compounds. The categories Unlikely DILI and Idiosyncratic DILI, due to their label ambiguity, were excluded from model training and reserved for downstream testing.

In the training data, DILI positives include the highest concentrations and those above 20x Cmax, while DILI negatives include the lowest concentrations and those below 80x Cmax. This approach ensures unambiguous labeling while minimizing bias from overdose signatures in negatives and inactive signatures in positives. Using these categorizations, our model was trained to distinguish between 177 positive and 93 negative control data points. An additional set of 70 compounds known for their idiosyncratic effects, each linked to fewer than 12 case reports, were exclusively used for testing. Idiosyncratic hepatotoxicity, characterized by its unpredictability and lack of dose-response or temporal patterns, often leads to drug withdrawals despite thorough clinical testing. An independent in-house experiment was conducted for blind validation, comprising 46 compounds: 32 DILI positives and 14 negatives. This included well-established DILI positives/negatives and a real-world set with 4 recent clinical failures.

Toxicogenomics resources and initial efforts

Human toxicogenomics benefits from resources such as TG-GATES, CMap and L1000. CMap and L1000 catalogue genomic responses in in vitro cell lines, primarily cancer lines, and can be used to study cytotoxicity, however, their utility for DILI is limited as our initial efforts showed poor predictive performance likely due to the lack of physiologically relevant cell types and relevant range of concentrations. In contrast, TG-GATES provides a rich microarray database tailored for DILI research, including data from primary human hepatocytes (PHH) exposed to 158 compounds. Our initial proof-of-concept, utilizing TG-GATES data, yielded meaningful predictions. In a rigorously designed in-house pilot experiment involving 46 compounds for blind validation, our model demonstrated a 62% sensitivity in accurately identifying DILI positives and an 92% specificity for negative control compounds, demonstrating the potential of toxicogenomics in identifying DILI risks during drug development.

Quality control

We have created DILImap, our RNA-seq library encompassing 300 compounds, to improve the predictive power and mechanistic coverage of our ML model. We applied stringent quality control metrics and filtering criteria to retain only high-quality RNA-seq samples:

1.
Total RNA counts > mean −2.5 standard deviations (~700,000 counts).
2.
Mitochondrial RNA fraction <9%.
3.
Correlation between replicates >0.99.

In addition to these technical filters, we performed a hepatocyte fidelity check: each sample was assessed for expression of liver-specific marker genes and the preservation of a hepatocyte-like transcriptional profile.

Differential signatures

Samples passing all QC filters were used to compute compound-specific differential expression signatures. We used DESeq2, a standard RNA-seq analysis tool, which models read counts using a negative binomial distribution. DESeq2 adjusts for library size differences and estimates gene-specific dispersion, enabling accurate detection of differentially expressed genes (DEGs).

For each compound-dose combination, we compared treated samples against matched DMSO controls from the same plate. This plate-specific normalization controls for potential batch effects and ensures that the derived signatures reflect treatment-specific transcriptional responses.

Pathway signatures

Pathway enrichment analysis is a statistical approach used to identify whether specific biological pathways are significantly enriched with differentially expressed genes (DEGs) in RNA-seq data. The significance of this enrichment is assessed using p-values, which indicate the likelihood that the observed enrichment occurred by chance. In our analysis, we utilized the widely adopted hypergeometric test to compute p-values. This test calculates the probability of observing the number of DEGs in a given pathway, considering the total number of genes in the pathway and the overall number of DEGs in the dataset. To derive pathway signatures from p-values, we applied a -log10 transformation of the adjusted p-value, or False Discovery Rate (FDR). The FDR accounts for multiple testing, controlling the proportion of false positives in the analysis. The resulting -log10(FDR) scores serve as the input to our model, providing a robust representation of pathway enrichment.

ToxPredictor classifier for cross-validation

To develop ToxPredictor, we evaluated a range of machine learning models, including Logistic Regression, Support Vector Classifier (SVC), Random Forest, Gradient Boosting Classifier, Hist Gradient Boosting Classifier, XGBoost Classifier, LGBM Classifier and Multi-Layer Perceptron (MLP). Each model was evaluated using a 5-fold stratified compound-level cross-validation strategy, ensuring that all samples from a single compound were held out together in each fold to prevent data leakage. This compound-level splitting reflects a realistic generalization scenario for novel compound prediction.

We optimized hyperparameters using grid search, assessing performance across the following metrics:

Area Under the ROC Curve (AUC): Measures the overall ability to distinguish DILI from No-DILI compounds.
Precision: Fraction of predicted DILI compounds that are truly DILI-positive, relevant for minimizing false positives.
Recall: Fraction of true DILI compounds that are correctly identified, reflecting the model’s ability to avoid false negatives.
Inter-fold correlation: Quantifies agreement in predicted probabilities between models trained on different splits.
Monotonic dose response: Assesses whether increasing doses of a compound correspond to increasing predicted DILI risk.

Model selection strategy

Generalizability: prioritize high validation AUC with low train–validation gap to avoid overfitting → Best models had AUC ≥ 0.74 (RF, HistGB, LGBM, XGB); RF showed the lowest gap (~0.09 vs. 0.16–0.20 for GBMs).
Stability: require high inter-fold correlation for consistent predictions across folds → RF achieved 0.98–1.0, higher than GBMs (0.8–0.9).
Clinical utility: balance precision (reduce false positives) and recall (reduce false negatives) → RF performance was comparable to other top models.
Interpretability: prefer simpler, easier-to-interpret models at similar performance → RF is more interpretable than boosting methods.

Final choice: Random Forest was selected as the most balanced and robust model, combining strong validation AUC, minimal overfitting, high stability, and interpretability. Although XGBoost and LightGBM reached similar AUCs, they showed greater overfitting (train-validation AUC gap) and lower stability (inter-fold correlation). These characteristics along with its ability to handle non-linear relationships, its robustness to noise, and its suitability for datasets of moderate size, make Random Forest the most robust choice for generalization to unseen compounds.

The optimal hyperparameters were:

n_estimators = 100, max_depth = 2, min_samples_split = 2, min_samples_leaf = 1

This combination of performance and robustness established Random Forest as the base model for ToxPredictor.

ToxPredictor ensemble modeling for blind validation

To ensure predictive robustness, we employed a 30-model Random Forest ensemble using bootstrap aggregation (bagging). Each model was trained on a unique bootstrap sample from the training data, ensuring diversity in training instances and mitigating overfitting. We selected 30 folds to strike a balance between accuracy and ensemble stability. In our benchmarks, ensembles with 30 independent learners consistently achieved smooth, reproducible predictions and stable dose-responses. All fold models used the hyperparameters optimized during cross-validation. At prediction time on unseen compounds, model outputs are averaged with equal weighting, producing a consensus decision that mitigated individual model variance, ensuring robust performance by combining the strengths of each base learner, leading to stable and reliable classification of DILI vs. No-DILI. In blind validation, where predictions were made on compounds not seen during training or model selection, the ensemble demonstrated high AUC, dose consistency, and biological plausibility, confirming its utility for real-world DILI risk assessment.

DILI risk probabilities

ToxPredictor outputs a predicted DILI probability, defined as the average prediction across the 30 Random Forest models in the ensemble. This probability reflects the model’s confidence that a compound induces liver injury, based on its pathway perturbation profile. To enable binary classification for downstream evaluation and decision-making, we defined a DILI risk threshold of 0.7. Compound-dose pairs with predicted probabilities ≥0.7 are considered DILI-positive, while those <0.7 are considered DILI-negative. This threshold was empirically selected to maximize the trade-off between sensitivity and specificity in our benchmark datasets, ensuring robust identification of true DILI compounds while minimizing false positives.

DILI safety margins

ToxPredictor computes a Margin of Safety (MOS) for each compound, defined as the ratio between the first DILI dose and its maximum plasma concentration (Cmax) at therapeutic levels. The first DILI dose is the lowest dose at which a compound crosses the 0.7 DILI probability threshold, indicating transcriptional evidence of hepatotoxicity. We use total Cmax values rather than free (unbound) Cmax, as both approaches yield comparable predictive performance, but total Cmax offers broader availability across clinical data sources and aligns with common reporting in literature.

For validation and benchmarking, to classify a compound as DILI, we selected a Margin of Safety (MOS) threshold of 80, determined empirically from the training set. This value reflects an optimized trade-off between sensitivity and specificity—minimizing false positives while reaching a performance plateau that captures the majority of true DILI liabilities. Importantly, this threshold is consistent with values commonly used in the literature, where MOS cutoffs typically range from 10 to 100 depending on context, placing our choice within accepted toxicological standards.

Prioritization of predictive features/pathways

To identify and prioritize features and pathways relevant to DILI risk prediction, feature importance within the Random Forest ensemble was assessed using statistical significance, impurity reduction, permutation importance, and direct discriminative power, ensuring a comprehensive and biologically meaningful selection of key predictors of DILI risk:

Statistical Significance: Features were prioritized if they demonstrated at least one significant differential occurrence across compounds, with a p-value threshold of 0.001.
Mean Decrease in Impurity (MDI): A standard metric for tree-based models that quantifies the reduction in impurity (e.g., Gini impurity) achieved by splits on specific features. The MDI values were averaged across all trees in the ensemble, providing a robust measure of a feature’s overall predictive contribution. Features with higher MDI scores were deemed more important for the model’s decision-making.
Permutation Importance: The values of each feature were shuffled individually, and the resulting change in model performance, specifically the Area Under the Curve (AUC), was measured. A greater reduction in AUC indicated a higher importance of the feature, offering additional insight into the features that most strongly influence the model’s predictions.
Discriminative Power (AUC): Each feature’s ability to distinguish between DILI and no-DILI categories was evaluated directly by calculating its individual AUC score, ensuring that features with strong discriminative capabilities were prioritized.
Spearman correlation: Correlation of pathway signatures with predicted DILI risk.

Empirical DILI likelihoods for dose recommendations

Empirical cumulative DILI likelihoods are calculated to inform dose recommendations that minimize the risk of DILI. For each compound, safety margins are computed at varying hypothetical Cmax values as the ratio of the model-predicted first DILI dose to the corresponding Cmax. For each safety margin, two cumulative percentages are derived: (1) the percentage of DILI compounds with safety margins above the given value, which increases monotonically from 0 to 1 as Cmax increases, and (2) the percentage of non-DILI compounds with safety margins below the same value, which decreases monotonically from 1 to 0. The empirical cumulative DILI likelihood at each safety margin is calculated as the difference between these two percentages, effectively representing the relative enrichment of DILI compounds compared to non-DILI compounds at or above a given margin. This approach captures the overall relationship between safety margins and DILI likelihood across the dataset, rather than focusing on isolated points, thereby enabling robust differentiation between high-risk and low-risk compounds. The resulting cumulative risk profiles provide a quantitative framework to guide dose selection.

Benchmarking against structure-based in-silico methods

We benchmarked ToxPredictor against three state-of-the-art structure-based models: DILIGeNN, a graph neural network trained on molecular graphs; DILIPredictor, a random forest ensemble model that integrates chemical structure, physicochemical properties, pharmacokinetic parameters and predicted proxy-DILI data; and TxGemma, a generalist large language model (LLM) fine-tuned on biomedical tasks from the Therapeutic Data Commons (TDC). The evaluation aimed to assess the predictive performance of each approach on overlapping and unseen compounds using balanced accuracy, sensitivity, and specificity.

DILIPredictor: Random Forest ensemble integrating proxy-DILI labels and chemical structure

A Random Forest model trained on nine proxy-DILI labels (e.g., mitochondrial toxicity, BSEP inhibition) in conjunction with chemical structural features derived from SMILES strings of 1111 DILI compounds. We reproduced the model from its public repository, converting the Poetry environment to Conda (via poetry2conda) and using scikit-learn v1.2.0 to match the pretrained model version. As compound names were not provided, we standardized SMILES and generated InChIKeys (non-isotopic, non-stereochemical layer) to identify 483 unseen compounds; 6 failed at inference, 6 more were deduplicated, resulting in 471 (98 DILI + , 373 DILI − ). This approach may still introduce limited data leakage if SMILES differ from those used during training. For benchmarking against other in-silico chemistry models, we focused on 314 compounds identified as unseen by both DILIGeNN and DILIPredictor, to provide a more conservative evaluation set that helps reduce the risk of data leakage; particularly important for DILIPredictor, which lacked compound identifiers and used a final model re-trained on the test set. For benchmarking against ToxPredicor, we took the overlap of the 471 compounds with DILImap yielding 30 compounds (23 DILI + , 7 DILI − ).

DILIGeNN: Graph Neural Network on Molecular Graphs

A graph neural network (GNN) trained on molecular graphs derived from SMILES strings of 1167 DILI compounds. We benchmarked the best-performing GraphSAGE models obtained after sequential warm starts, and used the recommended custom molecular graph representations of SMILES for each compound. To identify unseen compounds, we cross-referenced compound names, SMILES and InChIKeys (non-isotopic, non-stereochemical layer), and identified 349 unseen compounds; 18 failed inference, yielding a final evaluation set of 331 compounds (47 DILI + , 284 DILI-). Of these, 8 (5 DILI + , 3 DILI − ) overlapped with DILImap and were used for benchmarking. Final predictions were based on the mean probability and majority vote across four GraphSAGE models. In addition, we benchmarked DILIGeNN against DILIPredictor on 314 shared unseen compounds (45 DILI + , 269 DILI-) to evaluate real-world performance of in silico models across a broader evaluation set.

TxGemma: Generalist Language Models for Biomedical Tasks

A suite of generalist large language models (LLMs), fine-tuned from Gemma-2 (2B, 9B, 27B parameters) on biomedical tasks from the Therapeutic Data Commons (TDC). For DILI classification, models were trained on 475 compounds (325 train, 54 validation, 96 test). To avoid data leakage, we used SMILES and InChIKeys to select 715 unseen compounds (178 DILI + , 537 DILI-); however, note that data leakage could not be fully ruled out. Of these, 97 compounds (69 DILI + , 35 DILI-) overlapped with DILImap and were used for benchmarking. TxGemma models were deployed on Google Cloud Vertex AI using the official cookbook (https://github.com/google-gemini/gemma-cookbook/tree/main/TxGemma) with recommended prompts. The 27B achieved the best performance and was used for comparison with ToxPredictor; AUROC was not computed due to the model only providing DILI label outputs rather than prediction probability outputs. Performance improvements may be possible with additional fine-tuning and prompt engineering.

Data analysis software used in this study

The study used NumPy, pandas, and SciPy for computation, AnnData for single-cell data handling, scikit-learn = 1.4.0 for machine learning, and matplotlib/seaborn for visualization. Dask enabled scalable computing and quilt3 managed dataset access. DESeq2 supported differential expression and gseapy pathway enrichment. XGBoost and LightGBM were used for model comparison to select the optimal base model for ToxPredictor. Benchmarking against in-silico methods employed PyTorch Geometric for graph modeling, Captum for interpretability, and RDKit for cheminformatics.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets and trained models are accessible via an S3 bucket for seamless integration with Jupyter notebooks, with detailed descriptions and access instructions provided at https://www.dilimap.org. In addition, the data have been deposited in GEO under a Creative Commons license (GSE308567). Processed training data (pathway-level signatures used as model input) and both raw and processed validation data are provided to enable full reproducibility of all model training and validation steps. The raw training data (gene expression count matrices) contain proprietary information and are not publicly available. Academic researchers may request access for internal, non-commercial use via DILImap@cellarity.com, with requests reviewed within 4–8 weeks. Approved data are available for 2 weeks and must be deleted within 6 months. Commercial access requires a data-sharing agreement. The data from the Open TG-GATEs database (http://dbarchive.biosciencedbc.jp/en/open-tggates/download.html)¹⁵ were used in an initial proof-of-concept to establish a toxicogenomics baseline performance, whereas all results reported in this manuscript are based on the internally generated data described above. The use of all datasets in this study complies with the terms and conditions of their respective repositories and data providers.

Code availability

All code, reproducibility notebooks, and results are available at https://www.dilimap.org, which serves as the central access point for this work. The full Python implementation is hosted at https://www.github.com/Cellarity/DILImap, with reproducibility notebooks and results at https://www.github.com/Cellarity/DILImap_reproducibility. Both repositories are also archived at https://zenodo.org/records/17290520⁶¹.

References

A New Standard. http://tools.thermofisher.com/content/sfs/brochures/D01834~.pdf.
Olson, H. et al. Concordance of the toxicity of pharmaceuticals in humans and in animals. Regul. Toxicol. Pharmacol. 32, 56–67 (2000).
Article CAS PubMed Google Scholar
Onakpoya, I. J., Heneghan, C. J. & Aronson, J. K. Post-marketing withdrawal of 462 medicinal products because of adverse drug reactions: a systematic review of the world literature. BMC Med. 14, 10 (2016).
Funk, C. & Roth, A. Current limitations and future opportunities for prediction of DILI from in vitro. Arch. Toxicol. 91, 131–142 (2017).
Article CAS PubMed Google Scholar
Robles-Díaz, M., Medina-Caliz, I., Stephens, C., Andrade, R. J. & Lucena, M. I. Biomarkers in DILI: One more step forward. Front. Pharmacol. 7, 267 (2016).
Article PubMed PubMed Central Google Scholar
Chalasani, N. & Björnsson, E. Risk factors for idiosyncratic drug-induced liver injury. Gastroenterology 138, 2246–2259 (2010).
Article CAS PubMed Google Scholar
Mosedale, M. & Watkins, P. B. Drug-induced liver injury: Advances in mechanistic understanding that will inform risk management. Clin. Pharmacol. Ther. 101, 469–480 (2017).
Article CAS PubMed Google Scholar
Allison, R. et al. Drug induced liver injury - a 2023 update. J. Toxicol. Environ. Health B Crit. Rev. 26, 442–467 (2023).
Article CAS PubMed Google Scholar
Weaver, R. J. et al. Managing the challenge of drug-induced liver injury: a roadmap for the development and deployment of preclinical predictive models. Nat. Rev. Drug Discov. 19, 131–148 (2020).
Article CAS PubMed Google Scholar
Chen, M. et al. Quantitative structure-activity relationship models for predicting drug-induced liver injury based on FDA-approved drug labeling annotation and using a large collection of drugs. Toxicol. Sci. 136, 242–249 (2013).
Article CAS PubMed Google Scholar
Kim, E. & Nam, H. Prediction models for drug-induced hepatotoxicity by using weighted molecular fingerprints. BMC Bioinforma. 18, 227 (2017).
Article Google Scholar
Yang, S., Ooka, M., Margolis, R. J. & Xia, M. Liver three-dimensional cellular models for high-throughput chemical testing. Cell Rep. Methods 3, 100432 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, M. et al. DILIrank: the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans. Drug Discov. Today 21, 648–653 (2016).
Article CAS PubMed Google Scholar
Serrano, J. LiverTox: An online information resource and a site for case report submission on drug-induced liver injury. Clin. Liver Dis. (Hoboken) 4, 22–25 (2014).
Article PubMed Google Scholar
Igarashi, Y. et al. Open TG-GATEs: a large-scale toxicogenomics database. Nucleic Acids Res 43, D921–D927 (2015).
Article CAS PubMed Google Scholar
O’Brien, P. J. et al. High concordance of drug-induced human hepatotoxicity with in vitro cytotoxicity measured in a novel cell-based model using high content screening. Arch. Toxicol. 80, 580–604 (2006).
Article PubMed Google Scholar
Xu, J. J. et al. Cellular imaging predictions of clinical drug-induced liver injury. Toxicol. Sci. 105, 97–105 (2008).
Article CAS PubMed Google Scholar
Tolosa, L. et al. Development of a multiparametric cell-based protocol to screen and classify the hepatotoxicity potential of drugs. Toxicol. Sci. 127, 187–198 (2012).
Article CAS PubMed Google Scholar
Persson, M., Løye, A. F., Mow, T. & Hornberg, J. J. A high content screening assay to predict human drug-induced liver injury during drug discovery. J. Pharmacol. Toxicol. Methods 68, 302–313 (2013).
Article CAS PubMed Google Scholar
Garside, H. et al. Evaluation of the use of imaging parameters for the detection of compound-induced hepatotoxicity in 384-well cultures of HepG2 cells and cryopreserved primary human hepatocytes. Toxicol. Vitr. 28, 171–181 (2014).
Article CAS Google Scholar
Schadt, S. et al. Minimizing DILI risk in drug discovery — A screening tool for drug candidates. Toxicol. Vitr. 30, 429–437 (2015).
Article CAS Google Scholar
Proctor, W. R. et al. Utility of spherical human liver microtissues for prediction of clinical drug-induced liver injury. Arch. Toxicol. 91, 2849–2863 (2017).
Article CAS PubMed PubMed Central Google Scholar
Khetani, S. R. et al. Use of micropatterned cocultures to detect compounds that cause drug-induced liver injury in humans. Toxicol. Sci. 132, 107–117 (2013).
Article CAS PubMed Google Scholar
Porceddu, M. et al. Prediction of liver injury induced by chemicals in human with a multiparametric assay on isolated mouse liver mitochondria. Toxicol. Sci. 129, 332–345 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gustafsson, F., Foster, A. J., Sarda, S., Bridgland-Taylor, M. H. & Kenna, J. G. A correlation between the in vitro drug toxicity of drugs to cell lines that express human P450s and their propensity to cause liver injury in humans. Toxicol. Sci. 137, 189–211 (2014).
Article CAS PubMed Google Scholar
Vorrink, S. U., Zhou, Y., Ingelman-Sundberg, M. & Lauschke, V. M. Prediction of drug-induced hepatotoxicity using long-term stable primary hepatic 3D spheroid cultures in chemically defined conditions. Toxicol. Sci. 163, 655–665 (2018).
Article CAS PubMed PubMed Central Google Scholar
Albrecht, W. et al. Prediction of human drug-induced liver injury (DILI) in relation to oral doses and blood concentrations. Arch. Toxicol. 93, 1609–1637 (2019).
Article CAS PubMed Google Scholar
Aleo, M. D. et al. Moving beyond binary predictions of human drug-induced liver injury (DILI) toward contrasting relative risk potential. Chem. Res. Toxicol. 33, 223–238 (2020).
Article CAS PubMed Google Scholar
Sakatis, M. Z. et al. Preclinical strategy to reduce clinical hepatotoxicity using in vitro bioactivation data for >200 compounds. Chem. Res. Toxicol. 25, 2067–2082 (2012).
Article CAS PubMed Google Scholar
Dawson, S., Stahl, S., Paul, N., Barber, J. & Kenna, J. G. In vitro inhibition of the bile salt export pump correlates with risk of cholestatic drug-induced liver injury in humans. Drug Metab. Dispos. 40, 130–138 (2012).
Article CAS PubMed Google Scholar
Wang, E. et al. TxGemma: Efficient and Agentic LLMs for Therapeutics. arXiv [cs.AI] (2025).
Lee, T. & Posma, J. Improving drug-induced liver injury prediction using graph neural networks with augmented graph features from molecular optimisation. ChemRxiv https://doi.org/10.26434/chemrxiv-2024-d12gk-v2 (2025).
Seal, S. et al. Improved detection of drug-induced liver injury by integrating predicted in vivo and in vitro data. Chem. Res. Toxicol. 37, 1290–1305 (2024).
Article CAS PubMed PubMed Central Google Scholar
Walker, P. A., Ryder, S., Lavado, A., Dilworth, C. & Riley, R. J. The evolution of strategies to minimise the risk of human drug-induced liver injury (DILI) in drug discovery and development. Arch. Toxicol. 94, 2559–2585 (2020).
Article PubMed PubMed Central Google Scholar
Ewart, L. et al. Performance assessment and economic analysis of a human Liver-Chip for predictive toxicology. Commun. Med. (Lond.) 2, 154 (2022).
Article PubMed Google Scholar
Fäs, L. et al. Physiological liver microtissue 384-well microplate system for preclinical hepatotoxicity assessment of therapeutic small molecule drugs. Toxicol. Sci. 203, 79–87 (2025).
Article PubMed Google Scholar
Wilkening, S., Stahl, F. & Bader, A. Comparison of primary human hepatocytes and hepatoma cell line Hepg2 with regard to their biotransformation properties. Drug Metab. Dispos. 31, 1035–1042 (2003).
Article CAS PubMed Google Scholar
Olsavsky, K. M. et al. Gene expression profiling and differentiation assessment in primary human hepatocyte cultures, established hepatoma cell lines, and human liver tissues. Toxicol. Appl. Pharmacol. 222, 42–56 (2007).
Article CAS PubMed PubMed Central Google Scholar
Grinberg, M. et al. Toxicogenomics directory of chemically exposed human hepatocytes. Arch. Toxicol. 88, 2261–2287 (2014).
Article CAS PubMed Google Scholar
Kiamehr, M. et al. Dedifferentiation of primary hepatocytes is accompanied with reorganization of lipid metabolism indicated by altered molecular lipid and miRNA profiles. Int. J. Mol. Sci. 20, 2910 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 52, D1180–D1192 (2024).
Article CAS PubMed Google Scholar
Shah, F. et al. Setting clinical exposure levels of concern for drug-induced liver injury (DILI) using mechanistic in vitro assays. Toxicol. Sci. 147, 500–514 (2015).
Article CAS PubMed Google Scholar
Williams, D. P., Lazic, S. E., Foster, A. J., Semenova, E. & Morgan, P. Predicting drug-induced liver injury with Bayesian machine learning. Chem. Res. Toxicol. 33, 239–248 (2020).
Article CAS PubMed Google Scholar
Kelder, T. et al. WikiPathways: building research communities on biological pathways. Nucleic Acids Res. 40, D1301–D1307 (2012).
Article CAS PubMed Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Rao, M. S. et al. Comparison of RNA-Seq and microarray gene expression platforms for the toxicogenomic evaluation of liver from short-term rat toxicity studies. Front. Genet. 9, 636 (2018).
Article CAS PubMed Google Scholar
Allard, J. et al. Drug-induced hepatic steatosis in absence of severe mitochondrial dysfunction in HepaRG cells: proof of multiple mechanism-based toxicity. Cell Biol. Toxicol. 37, 151–175 (2021).
Article CAS PubMed Google Scholar
Bethesda (MD): National Institute of Diabetes and Digestive and Kidney Diseases. in LiverTox: Clinical and Research Information on Drug-Induced Liver Injury (2012).
Hliwa, A., Ramos-Molina, B., Laski, D., Mika, A. & Sledzinski, T. The role of fatty acids in non-alcoholic fatty liver disease progression: An update. Int. J. Mol. Sci. 22, 6900 (2021).
Article CAS PubMed PubMed Central Google Scholar
Osborne, T. F. & Espenshade, P. J. Lipid balance must be just right to prevent development of severe liver damage. J. Clin. Investig. 132, 11 (2022).
Shang, H. et al. Gut microbiota-derived tryptophan metabolites alleviate liver injury via AhR/Nrf2 activation in pyrrolizidine alkaloids-induced sinusoidal obstruction syndrome. Cell Biosci. 13, 1 (2023).
Zhu, L. et al. The emerging role of ferroptosis in various chronic liver diseases: Opportunity or challenge. J. Inflamm. Res. 16, 381–389 (2023).
Article PubMed PubMed Central Google Scholar
Chen, S.-S. et al. Serum metabolomic analysis of chronic drug-induced liver injury with or without cirrhosis. Front. Med. (Lausanne) 8, 640799 (2021).
Article PubMed Google Scholar
Wagner, M., Zollner, G. & Trauner, M. Nuclear receptors in liver disease. Hepatology 53, 1023–1034 (2011).
Article CAS PubMed Google Scholar
da Silva, R. P., Eudy, B. J. & Deminice, R. One-carbon metabolism in fatty liver disease and fibrosis: One-carbon to rule them all. J. Nutr. 150, 994–1003 (2020).
Article PubMed Google Scholar
Cai, S.-Y. & Boyer, J. L. The role of bile acids in cholestatic liver injury. Ann. Transl. Med. 9, 737 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ray, W. A., Griffin, M. R. & Stein, C. M. Cardiovascular toxicity of valdecoxib. N. Engl. J. Med. 351, 2767 (2004).
Article CAS PubMed Google Scholar
Shrier, M., Díaz, J. E. & Tsarouhas, N. Cardiotoxicity associated with bupropion overdose. Ann. Emerg. Med. 35, 100 (2000).
Article CAS PubMed Google Scholar
Levine, M., Pizon, A. F., Padilla-Jones, A. & Ruha, A.-M. Warfarin overdose: a 25-year experience. J. Med. Toxicol. 10, 156–164 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kohonen, P. et al. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nat. Commun. 8, 15932 (2017).
Article CAS PubMed PubMed Central ADS Google Scholar
Bergen, V., Srikrishnan, S. & Cellarity Inc. Cellarity/DILImap. (Zenodo, 2025). https://doi.org/10.5281/ZENODO.17290520.

Download references

Acknowledgements

We sincerely thank Bill Pennie and Atli Thorarensen for bringing in pivotal ideas that shaped the work. We appreciate Govinda Bhisetti for the thorough review of the manuscript. Our gratitude extends to Robb Nicewonger for his work on the chemistry SOP, Cameron Reilly for his support with RNA extraction and automation, Laura Isacco for coordinating data generation timelines, Thao Tran and Wynter Guess for their assistance with compound screening, Winnie Lee and Brian Yi for their work on RNA extraction, all of which contributed to data generation foundational to this work. Finally, we thank the reviewers for their thoughtful and constructive requests, which helped refine and strengthen the work.

Author information

Authors and Affiliations

Cellarity Inc., Somerville, MA, USA
Volker Bergen, Konstantia Kodella, Sreenath Srikrishnan, Ornella Barrandon, Sara Anderson, Max Rogers-Grazado, Casey Fowler, Hirit Beyene, Nicole Robichaud, Timothy Fulton, Nina Lapchyk, Mauricio Cortes, Nick Plugis, Matthew Goddeeris & Mahdi Zamanighomi

Authors

Volker Bergen
View author publications
Search author on:PubMed Google Scholar
Konstantia Kodella
View author publications
Search author on:PubMed Google Scholar
Sreenath Srikrishnan
View author publications
Search author on:PubMed Google Scholar
Ornella Barrandon
View author publications
Search author on:PubMed Google Scholar
Sara Anderson
View author publications
Search author on:PubMed Google Scholar
Max Rogers-Grazado
View author publications
Search author on:PubMed Google Scholar
Casey Fowler
View author publications
Search author on:PubMed Google Scholar
Hirit Beyene
View author publications
Search author on:PubMed Google Scholar
Nicole Robichaud
View author publications
Search author on:PubMed Google Scholar
Timothy Fulton
View author publications
Search author on:PubMed Google Scholar
Nina Lapchyk
View author publications
Search author on:PubMed Google Scholar
Mauricio Cortes
View author publications
Search author on:PubMed Google Scholar
Nick Plugis
View author publications
Search author on:PubMed Google Scholar
Matthew Goddeeris
View author publications
Search author on:PubMed Google Scholar
Mahdi Zamanighomi
View author publications
Search author on:PubMed Google Scholar

Contributions

V.B. and K.K. conceived and co-led the project. V.B. developed the framework and authored the manuscript. K.K. directed the design and execution of the cell culture experiments. S.S. implemented and ran benchmarking against state-of-the-art in silico models. O.B. contributed to the study’s conceptual framework. S.A. contributed to the development of the culture system and conducted viability screens. M.R. established the initial experiments and contributed to RNA-seq method selection. C.F. implemented automation for data generation workflows. H.B. led compound management, screening and contributed to the chemistry SOP. N.R. performed library construction for sequencing. T.F. contributed to compound management and screening. N.L. established the Dotmatics workflows. M.C. supervised the data generation activities. N.P. was involved in the original conception of using transcriptomics to model off-target effects. M.G. provided strategic direction. M.Z. provided oversight for the machine learning aspects and strategic direction.

Corresponding authors

Correspondence to Volker Bergen or Mahdi Zamanighomi.

Ethics declarations

Competing interests

All authors were employees of Cellarity Inc. at the time the study was performed and hold equity in the company. The research was funded by Cellarity. V.B., K.K., and O.B. are inventors on patent application WO2025/024525, related to methods for DILI prediction, assigned to Cellarity.

Peer review

Peer review information

Nature Communications thanks Volker Lauschke, James Dear and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Peer Review File (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download XLSX )

Supplementary Data 2 (download XLSX )

Supplementary Data 3 (download XLSX )

Supplementary Data 4 (download XLSX )

Reporting Summary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Bergen, V., Kodella, K., Srikrishnan, S. et al. A large-scale human toxicogenomics resource for drug-induced liver injury prediction. Nat Commun 16, 9860 (2025). https://doi.org/10.1038/s41467-025-65690-3

Download citation

Received: 08 January 2025
Accepted: 21 October 2025
Published: 13 November 2025
Version of record: 13 November 2025
DOI: https://doi.org/10.1038/s41467-025-65690-3