Introduction

Single-cell RNA sequencing (scRNA-seq) has emerged as a central technique in biomedical research1,2,3,4, enabling high-resolution profiling of cellular heterogeneity across tissues, disease states, and therapeutic interventions5,6. Its adoption continues to grow across the pharmaceutical landscape7,8, where it supports applications in target discovery, biomarker stratification, and mechanism-of-action studies9,10,11,12,13. In parallel, public repositories have accumulated an unprecedented volume and diversity of scRNA-seq datasets, spanning multiple species, tissues, and experimental conditions14,15,16. This rapid expansion presents a unique opportunity to transform fragmented single-cell datasets into structured, decision-support workflows that enhance biological interpretation and accelerate translational research17,18.

Despite the availability of these datasets, integrating published studies into institutional and enterprise-level analysis workflows remains a slow and resource-intensive process (Fig. 1a). Data ingestion typically begins with manual interpretation of the source publication to establish biological context, followed by the extraction of key metadata—such as tissue type, species, disease condition, and accession identifiers—and culminates in the manual retrieval of datasets from corresponding public repositories. The process then continues with dataset processing and analysis, which often require adaptation to institutional conventions and custom scripting practices. These interpretation- and scripting-based tasks are often delegated to specialists with both domain knowledge and computational proficiency, resulting in high personnel costs and increased susceptibility to analyst-driven errors and inconsistent throughput due to user-dependent variability in single-cell data analysis. Given the repetitive and procedural nature of these tasks, shifting them left—from specialist bioinformaticians to bench scientists—would not only enable faster, more consistent execution of data workflows and reduce dependence on bespoke computational support, but also free up expert capacity for more exploratory or high-impact scientific efforts.

Fig. 1: Overview of the CellAtria agentic framework.
figure 1

a Manual data onboarding cycle for public single-cell datasets. This panel illustrates the conventional workflow followed by investigators when integrating newly published datasets into internal research environments. The process spans from dataset announcement and assignment through metadata extraction, data validation, pipeline configuration, and analysis execution. b Architecture of agentic triage and execution. CellAtria employs a large language model (LLM) to mediate task dispatch across a graph-based, multi-actor execution framework, which is integrated with the co-developed CellExpress pipeline. High-level actions such as document parsing, metadata structuring, dataset retrieval, and file organization are coordinated through the LLM-mediated interface and mapped to appropriate backend tools for execution, ultimately producing analysis-ready outputs.

Agentic artificial intelligence (AI) systems19,20 offer a structured approach to automating complex biomedical workflows by coupling large language models (LLMs)21,22 with domain-specific computational toolchains23. These systems operate through modular, executable architectures in which predefined computational functions are dynamically composed in response to user prompts and contextual cues. The LLM serves as a semantic interface layer, interpreting natural language prompts and dispatching appropriate computational actions, thereby enabling dialogue-driven interaction with underlying infrastructure. Consequently, agentic AI systems, through their adaptive control flow, autonomous decision-making, and operational self-sufficiency, offer a scalable framework for minimizing manual intervention and democratizing access to bioinformatics infrastructures.

While some contemporary perspectives propose leveraging LLMs for dynamic, ad-hoc analytical code generation at runtime24,25,26,27, such strategies encounter obstacles in practice. The primary challenge with direct code generation for complex single-cell bioinformatics tasks lies in ensuring consistent output and scientific validity28. Such methods inherently struggle with reproducibility, as LLM outputs can vary based on model versions and prompting nuances29. More critically, the generated code carries a substantial risk of hallucinations30,31, logical errors32,33, incompatibility with state-of-the-art methods or software (due to the LLM’s training data temporal cutoff)34,35, and the misapplication of domain-specific parameters36. Any of these issues can lead to scientifically unsound analyses without extensive human oversight37 and debugging38. Most importantly, in clinical or highly regulated biomedical settings, the lack of protocol standardization inherent in ad-hoc code generation risks violating established quality control and analysis guidelines39. More recent developments, such as scExtract40 and SCassist41, have explored LLM-assisted pipelines. However, both rely on scripted or command-line execution and do not provide an agentic planner or a conversational interface to support interactive, multi-turn orchestration.

Moving beyond the risks in ad-hoc code-generation strategy, we adopted a critical architectural decision: an LLM-mediated tool-centric paradigm that balances flexibility with the imperative for scientific rigor and reproducibility. Here, the LLM’s role shifts from writing raw code to intelligently orchestrating a library of pre-vetted, robust analytical tools. We developed CellAtria, an agentic AI system that enables end-to-end, document-to-analysis automation in single-cell research (Fig. 1b). By combining natural language interaction with a graph-based, multi-actor architecture execution framework (Fig. 2a), CellAtria links tasks ranging from literature parsing, metadata extraction, and dataset retrieval to scRNA-seq processing via its co-developed companion pipeline, CellExpress, which applies state-of-the-art processing steps to transform raw count matrices into analysis-ready single-cell profiles (Fig. 2b). Utilizing CellAtria user interface (Supplementary Fig. S1), researchers interact with a language model that orchestrates pre-validated analytical tools (Supplementary Fig. S2), eliminating the need for manual scripting while ensuring standardized, reproducible analyses and accelerating the reuse of public single-cell resources.

Fig. 2: From LLM‑mediated orchestration to automated single‑cell analysis.
figure 2

a Language model-mediated orchestration of toolchains. Upon receiving a user prompt, the CellAtria interface transfers the request to the LLM agent, which interprets the user’s intent and autonomously invokes relevant tools. Outputs are then returned through the interface, completing a full cycle of context-aware execution. b Structure of the CellExpress pipeline. CellExpress implements a standardized single-cell RNA-seq workflow, encompassing project setup, quality control (QC), normalization, batch correction, dimensionality reduction, clustering, and cell type annotation. This pipeline is fully customizable and seamlessly integrated with the CellAtria orchestration layer, enabling automated execution from raw count matrices to interpretable results. It produces a comprehensive analytical package, including: (1) An interactive HTML report summarizing the workflow with key QC metrics and visualizations; (2) A finalized, fully annotated AnnData object for downstream analyses; (3) A machine-readable configuration export for full auditability; and (4) A QC-filtered AnnData object for alternative usages.

Results

A comprehensive description of the agentic system, CellAtria, including its architectural design and implementation details, is provided in the Methods section. To illustrate CellAtria’s capabilities, we implemented three prototypical use cases, each representing a common scenario in single-cell research.

CellAtria extracts metadata from article URLs and retrieves study-level datasets

In translational and early discovery research, investigators often encounter newly published studies whose associated datasets are highly relevant to institutional priorities.

To demonstrate CellAtria’s capability for literature-driven data acquisition and analysis execution, we selected a publicly available longitudinal transcriptomic study profiling immune responses in 2-month-old infants following routine vaccination42. Upon receiving the article URL, the agent conducts a multi-turn dialogue to parse the manuscript directly from the journal webpage, extract key structured metadata—including sample annotations and accession identifiers—and coordinate dataset retrieval from corresponding public repositories using GSE-level (GEO study-wide) accessioning (Supplementary Figs. S3S10). Following user validation and direction, the agent proceeds to triage these data, executing essential organizational tasks and ensuring strict naming compatibility for seamless integration with downstream analytical functions (Supplementary Figs. S10, S11).

We note that this multi-turn interaction established a dynamic feedback loop, wherein the agent proactively surfaced context-aware downstream options, enabling real-time user validation and iterative adaptation (anticipate needs and suggest actions) aligned with the evolving agent strategic objective. This form of interactive and goal-conditioned reasoning exemplifies core agentic capabilities, which surpass the constraints of traditional rule-based approaches (e.g., static dashboard systems) by effectively handling uncertainty, enabling workflow reconfiguration on demand, and supporting user-driven exploration in a dialogic manner.

Thus, CellAtria effectively bridges literature discovery and structured dataset acquisition, establishing a foundation for automated, goal-directed workflows in single-cell research.

CellAtria parses scientific PDFs and retrieves sample-level datasets

Direct access to structured content on publisher websites can be restricted by technical barriers (e.g., dynamic page rendering, authentication walls) or constrained by licensing terms that limit programmatic scraping—even when institutional access rights are granted.

To address this limitation, the second scenario demonstrates metadata extraction from a locally stored PDF file in place of an article URL. CellAtria mitigates these access constraints by supporting direct parsing of PDF documents, enabling metadata retrieval in settings where web-based extraction is infeasible. We sought to demonstrate this capability by supplying CellAtria with an offline copy of a published study profiling T cell states across tumor, lymph node, and normal tissues in non-small cell lung cancer patients undergoing immune checkpoint blockade43. Upon uploading the PDF file, the agent engages in a multi-turn dialogue to extract structured metadata using a built-in document parser, enabling dataset retrieval even when the journal webpage scraping is infeasible (Supplementary Figs. S12, S13). In contrast to the first scenario, where datasets are retrieved at the study level, data acquisition here is carried out at the GEO sample level (GSM) using a task-specific tool, enabling fine-grained retrieval. The agent carries out each stage in response to user instructions conveyed through conversational natural language (Supplementary Figs. S14, S15).

We note that, to support verification and transparency, the real-time log viewer within the CellAtria interface continuously displays each issued prompt along with a status indicator confirming whether the associated tool invocation was successful. This persistent execution trace facilitates user comprehension and troubleshooting during live interactions (Supplementary Figs. S3S15).

Thus, CellAtria enables flexible, fine-grained data acquisition even under restricted access conditions, while preserving full transparency and traceability through real-time execution logs.

CellAtria retrieves pre-identified datasets from public databases and manages file integration

In applied research settings, analysts often begin with pre-identified datasets—discovered through public single-cell portals or cross-study meta-analyses—where the source publication is already known or metadata extraction has been performed independently. In such cases, CellAtria supports direct ingestion of datasets from public repositories using user-supplied download URLs, bypassing upstream literature parsing and metadata extraction steps. This scenario was demonstrated using two curated collections (H5AD-formatted) from the CZ Cell by Gene Discover platform14: one integrating scRNA-seq data across 13 single-cell studies from 8 tumor types and normal tissues to delineate myeloid-derived cell states44, and another compiling scRNA-seq data from 223 patients across 9 cancer types to investigate cancer cell-specific responses to immune checkpoint blockade45. Upon receiving the dataset locations, the agent initiates a sequence of interactions with the user, subsequently retrieves the files, integrates them into the working directory, and prepares the necessary configuration files for downstream analysis (Supplementary Figs. S16S18).

This operational mode demonstrates the agent’s ability to coordinate data acquisition and file integration through natural language prompts, enabling flexible invocation of tools at non-linear entry points within the agent’s engineered execution narrative.

CellAtria enables shell-level execution and file navigation within agent-guided workflows

Scientific agentic systems must balance automation with transparency and user oversight, particularly over file system operations, environment context, and task provenance. While language models can coordinate tool execution, these lower-level operations are more effectively managed through dedicated interface components that complement the conversational layer. To address this, CellAtria integrates a set of interactive panels (n = 4) that expose critical system-level functionality during live agent sessions (Supplementary Fig. S19).

To demonstrate the backend interpretability and user oversight of CellAtria during natural language interaction, we performed a series of targeted tests. Submitting the very simple prompt “Article title?” (implicitly requesting extraction of the publication title from a given URL42), we observed the agent’s internal reasoning, tool invocation, and model output displayed step-by-step in the agent backend panel (Supplementary Fig. S19a). Despite the minimal input, the agent accurately inferred the intended task, invoked the appropriate tool, and returned only the relevant information—faithfully aligning its output with the user’s request. This live trace offered direct insight into how natural language queries are processed and aligned with the agent’s internal tool logic. To verify the agent’s workspace context and access to relevant data, we then utilized the embedded terminal panel to navigate the project directory. This interaction confirmed the agent’s correct positioning within the expected working path and its access to necessary input files and subdirectories (Supplementary Fig. S19b). Complementing the terminal view, the file browser panel allowed us to visually inspect the same directory structure interactively, further reinforcing the consistency between user-issued commands and the agent-managed file system (Supplementary Fig. S19c). Finally, to ensure end-to-end provenance, the export utility provides two downloadable artifacts from each session: (i) a machine-readable conversation transcript capturing prompts, tool calls, and model response traces (Supplementary Figs. S19d, S20), and (ii) a structured LLM metadata file that records the backend language model configuration (Supplementary Figs. S19d, S21).

In practice, when a failure occurs, the live log viewer surfaces the specific error, while the agent backend panel links it to the exact agentic step that failed. Users can then inspect the cause, verify file presence or structure through the interactive file browser, and, where necessary, apply corrective actions using the embedded terminal, such as renaming, decompressing, or relocating files. All of these actions are possible without leaving the UI environment. Collectively, these interactive components establish CellAtria as a transparent, traceable, and auditable agentic system. By embedding low-level controls within a high-level dialogue framework, the system balances automation with user-in-the-loop oversight—a critical feature for fostering trust in AI-driven scientific discovery pipelines.

CellAtria enables end-to-end automation of single-cell RNA-seq processing through CellExpress

Having demonstrated the capabilities and task-specific functionalities of CellAtria, we postulated that by interacting with a fully automated downstream pipeline, CellAtria could further transform literature-based inputs into fully processed single-cell datasets—minimizing the need for hands-on user time and specialized analytic expertise by linking study discovery with standardized downstream analysis through a unified, dialogue-driven workflow.

To this end, we developed CellExpress, a companion computational pipeline that standardizes the processing of scRNA-seq data, transforming raw inputs into biologically interpretable, analysis-ready outputs (Fig. 2b). CellExpress builds on previously published and validated methods, ensuring that their application is carried out in a consistent, efficient, and unified workflow with minimal user intervention. A complete methodological description is provided in the Methods section. To facilitate integration with CellAtria, we also developed a suite of agent-triggered tools that support comprehensive pipeline configuration, execution control, and real-time monitoring—enabling the agent to coordinate the entire analytical workflow through natural language interaction (Supplementary Fig. S2).

To demonstrate CellExpress orchestration through CellAtria agent, we extended the previously initiated scenario using the peripheral blood scRNA-seq dataset from 2-month-old infants42. As with upstream tasks, the pipeline’s configuration, execution, and monitoring were conducted entirely through natural-language dialogue between the user and the agent, culminating in the successful operation of CellExpress pipeline (Supplementary Figs. S22S28). Additionally, to evaluate the full analytical scope of CellExpress, we performed an autonomous pipeline execution with all modules enabled through complete argument specification. In this single-run execution (runtime: ~30 min), 18 samples were processed, yielding ~71,000 cells after quality control filtering, which included thresholds on gene count, UMI count, mitochondrial gene content, and doublet detection. Batch effects were subsequently corrected to account for sample-specific variation. Dimensionality reduction was conducted using both UMAP and t-SNE embeddings, followed by graph-based clustering. Automated cell type annotation showed high concordance between tissue-agnostic and tissue-specific models, both aligning with the original study. To quantify this alignment, we compared CellExpress-derived cell type compositions against the original expert annotations using label-harmonized compartments. The results showed strong agreement, with Pearson correlations of ~0.99, reflecting high concordance in major shared immune lineage frequencies (98%), including T cells, B cells, NK cells, and myeloid populations (Supplementary Fig. S29). Finally, cluster-level marker gene identification was performed to support downstream biological interpretation. All intermediate outputs are tabulated, visualized, and consolidated into an HTML summary report (the corresponding file is provided in the CellAtria GitHub repository; see Data Availability). In addition, all relevant artifacts, including the execution summary, sample metadata, and workflow configuration, are stored in a structured file to ensure auditability and reproducibility (Supplementary Fig. S30). A detailed description of the computational tools and models integrated into the CellExpress pipeline is provided in the Methods section.

Thus, CellExpress addresses persistent challenges in single-cell transcriptomic analysis by delivering a fully integrated, end-to-end workflow that ensures transparency and reproducibility through the use of rigorously benchmarked, field-standard components.

CellAtria enables full-lifecycle document-to-analysis execution through fully autonomous toolchain orchestration

The hallmark of robust agentic systems lies in their ability to autonomously synthesize complex operational sequences from a single, high-level command—abstracting procedural complexity while preserving fidelity to the intended outcome. Such generalizable and autonomous orchestration reflects a carefully designed, context-aware system prompt and an explicit, unambiguous definition of tool input/output (I/O) behaviors—elements that collectively underpin the system’s capacity for full-scope task execution.

To evaluate the extent of autonomous task coordination in CellAtria, we tested its ability to execute a complete document-to-analysis workflow using a single instruction, thereby eliminating the need for iterative user-agent interaction. In this scenario, the agent was provided with the primary article URL reporting longitudinal scRNA-seq data from 2-month-old infants42, with the objective of executing the CellExpress pipeline using the associated single-cell data. In response, and without further user input, CellAtria autonomously carried out the full underlying predesigned agentic workflow, leading to the successful execution of the CellExpress pipeline (Supplementary Fig. S31). In particular, CellAtria performed several key steps autonomously, including parsing the primary article, extracting structured metadata, retrieving the associated dataset, configuring the CellExpress pipeline with context-aware parameters, and dispatching the execution—all without manual intervention (Supplementary Fig. S32). Notably, this autonomous run, including the CellExpress pipeline’s runtime, completed all steps in under 10 min. This performance stands in stark contrast to the ~15 cumulative hours of manual effort typically required by a bioinformatics analyst for equivalent tasks, as per our internal benchmarks (including manual ingestion, metadata extraction, dataset retrieval, file reorganization, and fragmented script execution).

Hence, this assessment demonstrates CellAtria’s capability for strategic problem-solving by translating high-level objectives into a coherent sequence of tool invocations, marking its shift from prompt-driven response to an AI agent capable of autonomous workflow orchestration with minimal intervention and time-efficient performance.

CellAtria sustains scalable performance across multi-study compendia

To evaluate the robustness and scalability of CellAtria’s agentic automation, we benchmarked its end-to-end functionality across 25 publicly available human scRNA-seq datasets. These curated datasets span six major cancer types, including breast46,47,48,49,50 (n = 5), lung51,52,53 (n = 3), prostate54,55,56 (n = 3), colorectal57,58,59,60,61 (n = 5), ovary62,63,64,65,66 (n = 5), and pancreas67,68,69,70 (n = 4) (Supplementary Fig. S33). The cohort (290 samples) includes single-cell investigations of tumor-infiltrating immune remodeling (e.g., non-small cell lung cancer), resistance and phenotypic plasticity in metastatic settings (e.g., stage IV breast cancer), treatment-naive and relapsed disease states (e.g., prostate cancer), and immunologic shifts under checkpoint blockade (e.g., microsatellite instability high colorectal cancer). It also features high-resolution atlases of epithelial-immune crosstalk (e.g., breast and colorectal cancer) and spatial or multi-omics extensions (e.g., ovarian and pancreatic cancers). This biologically diverse panel was deliberately selected to challenge the agent across a range of disease etiologies, tissue microenvironments, and metadata contextual complexity, thereby offering a realistic and representative cross-section of biomedical use cases.

CellAtria autonomously executed the complete GEO-to-analysis workflow for each study using gpt-4o as the agentic controller. This agentic process involved dynamic metadata extraction from the GEO landing page (e.g., organism, disease label, tissue type, and single-cell GSM identifiers), automated dataset retrieval, and dynamic configuration and orchestration of the CellExpress pipeline. All benchmarking runs completed successfully without manual intervention (Supplementary Fig. S33).

Agent interaction metrics, focusing on the LLM’s internal reasoning and decision-making steps, revealed consistently low execution times, with an average of 1.45 ± 1.25 min. per task. Runtime variability was primarily driven by differences in dataset volume. To assess output consistency, two language model-specific indicators were used: the number of tokens generated per response (a proxy for content volume) and the response size in kilobytes (a proxy for serialized output size). These outputs averaged 6589 ± 1317 tokens and 20.5 ± 4.0 KB, respectively, with low Gini coefficients (≈0.10), indicating uniform verbosity and content balance across all runs (Supplementary Fig. S33).

We next sought to assess the agent performance when using a different LLM on the exact same dataset cohort. We therefore repeated the analysis using gpt-4o-mini as the agentic controller; all benchmarking runs again completed successfully without manual intervention. The average agentic task duration was 1.70 ± 1.70 min per dataset. The model generated 5439 ± 1369 tokens per run, corresponding to an average serialized output size of 16.9 ± 4.1 KB. The Gini coefficients for both token counts and output size were 0.14, indicating low dispersion and thus consistent verbosity across runs (Supplementary Fig. S33).

Across repeated runs, we observed minor lexical variation in the way biological descriptors were rendered (e.g., “PBMC” vs. “peripheral blood mononuclear cells,” or abbreviated disease labels vs. expanded forms). However, these differences were confined to the natural-language layer. CellExpress consumes the structured, schema-validated arguments produced by the supporting tool layer and is therefore agnostic to such surface-level term variability.

When integrated with the CellExpress pipeline, CellAtria supported full end-to-end processing of approximately one million post quality control (QC) cells, drawn from diverse input formats including 10× Genomics HDF5 (.h5), directory trios (matrix, barcodes, features), and plain text matrices (txt.gz). Average runtime was 3.16 min per dataset under default analysis settings. Each dataset contributed roughly 3.9 × 10⁴ post-QC cells (≈39k ± 32k), with memory usage scaling accordingly (10.1 ± 9.8GB per run). A moderately positive association was observed between post-QC cell count and memory consumption (Pearson r = 0.45, p < 0.05), indicating that RAM demand increases with dataset size, though not in a strictly linear fashion (Supplementary Fig. S33).

Hence, these results underscore CellAtria’s ability to handle complex and biologically rich single-cell datasets at scale. The agentic framework not only minimizes manual overhead but also maintains computational efficiency, output uniformity, and analytical robustness.

Discussion

Automation in single-cell analysis spans three trajectories: (i) targeted automation of labor-intensive, subjective steps (e.g., cell type annotation71,72), (ii) end-to-end pipelines assembled from automated modules73, and (iii) most recently, broad-scope, LLM-enabled code-generation workflows24. In this study, we unify these strands with an LLM-mediated, tool-centric paradigm that balances flexibility with scientific rigor and reproducibility. Specifically, we introduce CellAtria, an agentic system that enables dialogue-driven, document-to-analysis automation in single-cell research.

The system integrates a natural language interface with modular computational toolchains—including task-specific utilities that trigger the CellExpress pipeline (co-developed standalone standardized single-cell analysis pipeline)—to form a unified semantic layer that orchestrates data triage and analysis. Its composable architecture supports non-linear, context-aware interaction, allowing users to engage flexibly at different stages of data ingestion and preparation. This design enables researchers to process more studies with less effort, effectively improving their analytical capacity without sacrificing reproducibility or protocol adherence.

CellAtria’s orchestration-first, tool-centric strategy coordinates a pre-vetted modular toolchain (including the CellExpress pipeline), leveraging LLM strengths in intent interpretation and task delegation while eschewing free-form code synthesis. This approach guarantees that all executed analytical steps (including complex single-cell data analysis) adhere to established best practices, are fully transparent, and maintain the high level of reliability and auditability essential for robust scientific discovery, all while leveraging the full orchestration potential of LLMs.

Agentic systems operate according to an underlying execution narrative—a structured sequence of modular actions that defines how tasks are interpreted and fulfilled. While this narrative is inherently flexible and permits user-initiated entry at arbitrary points in the workflow, it remains anchored in a coherent logic that guides the agent’s behavior and goals. This design departs from traditional rule-based automation by allowing unconstrained user interaction while still aligning those inputs with predefined tools and execution pathways. In the case of CellAtria, this narrative integrates a four-stage operational sequence: (1) dynamic metadata extraction, (2) dataset acquisition, (3) file organization, and (4) downstream analysis execution. Each module can be invoked independently; however, this study focuses on the optimized execution path, which reflects the intended canonical workflow.

Agentic AI frameworks establish a principled division of responsibilities: domain experts engage with high-level analytical tasks through natural language interfaces, while system developers ensure the integrity and robustness of the underlying toolchains. The core reasoning engine of these agents—LLMs—while effective at semantically interpreting dynamic user prompts and aligning them with task objectives, remains inherently prone to hallucination30,31, often producing linguistically coherent yet semantically or factually incorrect outputs. To curb these vulnerabilities, CellAtria embeds safeguards at three levels: (1) tool-schema validation (rejecting ill-formed or non-existent actions), (2) restricted invocation patterns (permitting only vetted tool sequences and parameters), and (3) boundary-aware system prompts (explicitly steering the agent to decline or defer when capabilities are exceeded (Supplementary Fig. S34). Nevertheless, to safeguard analytical reliability and ensure interpretability, a human-in-the-loop paradigm, wherein a context-aware investigator actively evaluates, verifies, and contextualizes agent responses, remains indispensable.

Agentic AI systems ultimately inherit the strengths and weaknesses of the LLMs that power their reasoning layer; different LLMs may diverge in how they interpret edge cases, resolve ambiguous instructions, or render domain-specific concepts, because generation remains probabilistic rather than fully deterministic. This places a sustained burden on system designers to harden the tool layer, but it also means that users should treat agent outputs as decision support, not as authoritative replacements for expert review. At the same time, this dependency is a feature: as LLMs improve in grounding, factuality, and tool-use reliability, agentic workflows will gain robustness without requiring architectural rewrites.

CellAtria’s metadata extraction capabilities are optimized for scientific articles that conform to structured narrative conventions, such as standardized sectioning and consistent biomedical terminology. While the underlying language model has broad generalization capacity, the LLM may face ambiguity when applied to unstructured or idiosyncratic content, such as informal notes lacking hierarchical organization, inconsistent labeling, or documents with irregular phrasing (e.g., nonstandard abbreviations, ad hoc formatting, or domain-specific shorthand). For instance, when the agent encounters manuscripts that lack structured section headers or essential metadata fields (e.g., species, tissue type, disease), CellAtria defaults to marking those fields as unavailable. This safeguard is intentionally designed to prevent speculative inference and reduce hallucination risk.

A key design consideration for agentic systems is the LLM’s lack of direct internet access—a common constraint adopted to ensure security and maintain control over external interactions. Consequently, the system depends on deterministic, tool-mediated mechanisms for content retrieval and cannot independently browse or query online databases in real time. This reliance highlights the critical importance of standardized data management practices74: discrepancies between manuscript-reported metadata and repository-level annotations can hinder the reliability of agentic execution. We therefore advocate for closer harmonization between narrative metadata and structured repository schemas to better support scalable agentic applications.

The CellAtria framework’s objective was not to replicate outcomes from original publications, which would require access to exact computing infrastructure, parameter choices, and context-specific decisions, many of which are user-biased and often not publicly documented comprehensively. The variability introduced by manual, human-guided analysis reinforces the need for standardized frameworks like CellExpress. By enforcing schema-constrained execution and orchestrated automation, these tools mitigate analytic drift and interpretive subjectivity.

The modular design of CellAtria, combined with its graph-based execution architecture, enables extensibility: new decision-making and task-routing capabilities can be incorporated at the orchestration layer, while analytical and modality-specific logic is delegated to pluggable downstream pipelines. Therefore, the framework is pipeline-agnostic by construction, allowing alternative workflows to be integrated without redesigning the agentic core.

As automated frameworks increasingly interface with regulated data sources, ensuring ethical and compliant data handling becomes essential. When extracting metadata from scientific literature, CellAtria’s schema captures repository identifiers, publisher information, and author-reported conflicts of interest, providing legal and provenance context for each dataset. For proprietary or sensitive data, the Docker-based architecture of CellAtria enables reproducible execution within secure, auditable, and GxP-compliant environments, a critical requirement in clinical research.

In conclusion, CellAtria demonstrates how domain-informed agentic AI systems can operationalize scientific research processes in a computationally skill-agnostic manner, thereby accelerating discovery and supporting the transition toward next-generation, AI-integrated research ecosystems.

Methods

CellAtria interface and architectural framework

The CellAtria interface integrates seven principal components that provide fine-grained control over task execution and collectively facilitate rich agentic interaction (Supplementary Fig. S1): (1) Persistent Chatbot Window: Manages user–agent communication and maintains conversational continuity through an internal hidden state, supporting coherent, multi-turn exchanges. (2) User Input Panel: Accepts textual prompts and facilitates document uploads, with all inputs jointly processed through a unified execution handler. (3) Real-time Log Viewer: Displays user-agent transaction status. (4) Agent Backend Panel: Provides a live, step-by-step view of the agent’s internal reasoning, tool invocation sequence, and backend responses, directly supporting transparency and debugging. (5) Embedded Terminal Panel: Enables direct shell interaction within the agent’s runtime environment, facilitating robust system-level control without exiting the interface. (6) Interactive File Browser: Allows users to navigate project directories and inspect file contents within the active workspace, complementing terminal operations. (7) Export Utility: Captures session-level provenance by generating two structured artifacts: a machine-readable transcript of all user–agent interactions, and a metadata specification detailing the backend language model configuration.

These diverse interface elements are orchestrated atop a LangGraph-based75 backbone that encodes the agent’s execution flow logic as a directed graph of modular, state-aware functions, ensuring robust and interpretable coordination across heterogeneous toolchains and complex computational workflows.

CellAtria modular toolchain for task execution

To enable flexible and robust task execution, we developed a comprehensive suite of interoperable tools that encapsulate core agent functionalities across four principal operational domains (Supplementary Fig. S2): (1) Metadata Parsing and Semantic Structuring: Handles the extraction and organized representation of relevant information from diverse sources. (2) Programmatic Data Retrieval and Hierarchical Organization: Manages the automated acquisition of datasets and their structured arrangement within the workspace. (3) Standardized File Handling and Pre-processing: Ensures consistent management and preparation of data files for downstream analysis. (4) Automated Workflow Configuration and Execution Orchestration: Facilitates the dynamic setup and control of complex computational analysis pipelines. Each tool is precisely implemented as an atomic function with rigorously defined input/output (I/O) behavior, a design choice that fundamentally enables the agent to compose dynamic and reliable task sequences in response to user prompts. These tools are inherently embedded as graph nodes within CellAtria’s architectural backbone and accessed through natural-language interfaces, thereby facilitating intelligent, context-aware orchestration. This modular architecture not only accommodates user-initiated entry at arbitrary points within the workflow but also establishes a crucial foundation for tool reuse, adaptation, and the scalable extension of execution flows.

The LLM is responsible exclusively for unstructured content interpretation and structured metadata inference, as well as orchestrating execution of the pre‑vetted tools and analysis pipeline. All analytical decisions are governed by fixed logic or schema‑validated defaults, with no parameter choices delegated to the LLM. CellAtria does not currently employ Retrieval-Augmented Generation (RAG) or Model Context Protocol (MCP), as its modular toolchain and internal state tracking fulfill analogous roles in orchestrating dynamic task flows.

Web and PDF article ingestion for metadata extraction

As part of the metadata parsing and semantic structuring module, CellAtria implements lightweight content extraction routines for both web-based and local sources to facilitate automated ingestion of scientific literature. For journal URLs, the system programmatically retrieves HTML content using the “requests” library (v2.32.5) and isolates visible textual content via “BeautifulSoup” (v4.13.3), applying DOM-level (document object model) filtering to exclude non-informative elements such as scripts, metadata, and stylesheets. For local PDF documents, CellAtria leverages the “PyMuPDF” library “fitz” (v1.26.5) to extract paragraph-level text from the document’s text layer by iterating across pages. In both cases, the resulting structured text is passed to the language model for semantic parsing and downstream metadata field extraction. This approach is explicitly tool-mediated and does not rely on vector retrieval or RAG-style (Retrieval-Augmented Generation) pipelines.

Standardized single-cell data analysis via CellExpress

A cornerstone of CellAtria’s full-scale capabilities is the co-developed CellExpress, a standardized single-cell analysis pipeline - fully automated and engineered to deliver robust scRNA-seq analysis, from raw count matrices through comprehensive processing and report generation (Fig. 2b). Designed to lower bioinformatics barriers, CellExpress implements a comprehensive set of state-of-the-art, Scanpy-based76 processing stages, including: (1) quality control; performed globally or per sample, (2) data transformation; encompassing normalization, highly variable gene selection, and scaling, (3) dimensionality reduction; utilizing UMAP77 and t-SNE78, (4) graph-based clustering79, and (5) marker gene identification. Additional tools are seamlessly integrated to support advanced analysis tasks, such as doublet detection (Scrublet80), batch correction (Harmony81 and scVI82), and automated cell type annotation using both tissue-agnostic (SCimilarity83) and tissue-specific (CellTypist84) models.

All analytical steps are executed sequentially under centralized control, with parameters fully configurable via a comprehensive input schema. All arguments are made accessible through CellAtria’s agentic interface, allowing users to interact with and query their configurations using natural language (Supplementary Fig. S35). In addition, user-defined metadata (e.g., sex, disease status, tissue type, and other custom annotations) can be supplied alongside the dataset in a structured metadata table. These annotations are automatically incorporated into the analysis pipeline, enabling metadata-aware processing.

CellExpress implements a curated set of preprocessing steps and QC parameters informed by consensus practices in the field76,85,86,87,88,89,90. These defaults serve as a principled starting point rather than a fixed prescription and can be overridden when dataset-specific adjustment is warranted. Due diligence should be exercised by taking into account sample-specific factors, such as tissue type, dissociation protocol, platform-specific artifacts, and study objectives, when modifying pipeline parameters.

The pipeline natively supports a broad range of standard single‑cell inputs, including 10X Genomics Cell Ranger91 outputs (matrix, barcodes, and features), HDF5 (h5) matrices, and AnnData (H5AD) objects. In addition to 10X‑based chemistries, CellExpress also supports Parse Biosciences–formatted inputs92 (gene, metadata, and count matrix), enabling compatibility with non‑10X droplet‑based platforms. The pipeline also accepts generic count matrices supplied in plain‑text (txt.gz) or CSV‑style (csv.gz) format.

Upon execution, CellExpress processes a designated set of scRNA-seq samples and generates a comprehensive analytical package comprising four components: (1) a finalized, fully annotated AnnData object, directly suitable for downstream analysis; (2) a structured, publication-ready HTML report that captures a complete snapshot of the entire workflow, including key parameters, quality control metrics, dimensionality reduction, clustering results, and cell type annotations, presented through dynamics tables and graphical visualizations; (3) a complete export of all configuration settings in a machine-readable, standardized format to ensure auditability and full reproducibility; and (4) a quality control-filtered AnnData object generated for reuse in alternative or customized workflows.

CellExpress executes as a single-pass pipeline yet supports iterative workflows: intermediate outputs, such as QC visualizations, embeddings, clustering results, and annotations, are exposed in the generated HTML report, and parameters can be readily adjusted for seamless re-execution.

Designed for flexible deployment, CellExpress operates as a fully standalone pipeline for comprehensive scRNA-seq data analysis. It can be orchestrated either through an agentic system—as incorporated into the CellAtria framework—or via direct command-line execution. Furthermore, its Pythonic foundation directly addresses scalability constraints commonly associated with R-based pipelines (CellBrdige73), enabling more efficient handling of large single-cell datasets. We note that on-disk execution strategies, such as BPCell93, have improved the scalability of R-based workflows.

While CellExpress standardizes core processing, it intentionally delegates more specialized downstream analyses to the post-pipeline stage. This reflects the fact that such tasks often rely on study-specific priors, including subpopulation annotations, lineage assumptions, or temporal structure, which are not universally applicable. By restricting built-in methods to those broadly generalizable across datasets, CellExpress preserves flexibility for downstream analytical design to be tailored according to specific research objectives.

Preprocessing configuration for full CellExpress workflow demonstration

In the experiment benchmarking full pipeline execution of CellExpress using a peripheral blood scRNA‑seq dataset from 2‑month‑old infants (GSE213996), stringent quality control thresholds were applied: a minimum of 750 UMIs per cell, at least 250 detected genes per cell, and exclusion of genes expressed in fewer than 3 cells. Cells with mitochondrial gene content exceeding 15% were removed. Doublets were identified and filtered using Scrublet with a score cutoff of 0.25. Batch correction was performed using Harmony81, and cell type annotations were obtained using both SCimilarity83 and CellTypist84.

CellAtria containerized environment and CellExpress execution

The CellAtria runtime environment is fully containerized using Docker, enabling consistent and reproducible deployment across diverse computational infrastructures. This encapsulation strategy ensures environmental parity by isolating workflows from system-specific variability and mitigating software dependency conflicts, thereby facilitating seamless portability across local, cloud, and high-performance computing environments.

To preserve agent responsiveness during potentially long-running computations, CellAtria executes the CellExpress pipeline in a detached mode. Upon receiving a complete execution schema, the agent delegates the task to a background subprocess, decoupling it from the interactive session. Standard output and error streams are redirected to persistent log files for downstream inspection, and a unique process identifier is recorded to support real-time status tracking and diagnostics.

Large language model provider and computational environment

For all experiments, the LLM backend was provisioned via Azure OpenAI, specifically using managed deployments of gpt-4o (version as of 2024-11-20) and gpt-4o-mini (version as of 2024-07-18). No fine-tuning or domain-specific retraining was performed. The models were operated with a sampling temperature of 1.0 and a nucleus sampling parameter (top‑p) of 1.0. The CellAtria agent was built using the LangGraph (v0.5.4) orchestration framework. All operations were executed within a Docker container pinned to Python 3.12.9, running on an AWS EC2 r6i.32xlarge instance with 128 vCPUs and 1,024 GiB RAM. HTML reports produced by the CellExpress pipeline include an embedded version provenance section that programmatically records the exact versions of all R and Python packages used in the workflow.