HONeYBEE: enabling scalable multimodal AI in oncology through foundation model-driven embeddings

Tripathi, Aakash; Waqas, Asim; Schabath, Matthew B.; Yilmaz, Yasin; Rasool, Ghulam

doi:10.1038/s41746-025-02003-4

Download PDF

Article
Open access
Published: 23 October 2025

HONeYBEE: enabling scalable multimodal AI in oncology through foundation model-driven embeddings

Aakash Tripathi^1,2^na1,
Asim Waqas^1,3^na1,
Matthew B. Schabath³,
Yasin Yilmaz² &
…
Ghulam Rasool^1,2

npj Digital Medicine volume 8, Article number: 622 (2025) Cite this article

10k Accesses
1 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Harmonized ONcologY Biomedical Embedding Encoder (HONeYBEE) is an open-source framework that integrates multimodal biomedical data for oncology applications. It processes clinical data (structured and unstructured), whole-slide images, radiology scans, and molecular profiles to generate unified patient-level embeddings using domain-specific foundation models and fusion strategies. These embeddings enable survival prediction, cancer-type classification, patient similarity retrieval, and cohort clustering. Evaluated on 11,400+ patients across 33 cancer types from The Cancer Genome Atlas (TCGA), clinical embeddings showed the strongest single-modality performance with 98.5% classification accuracy and 96.4% precision@10 in patient retrieval. They also achieved the highest survival prediction concordance indices across most cancer types. Multimodal fusion provided complementary benefits for specific cancers, improving overall survival prediction beyond clinical features alone. Comparative evaluation of four large language models revealed that general-purpose models like Qwen3 outperformed specialized medical models for clinical text representation, though task-specific fine-tuning improved performance on heterogeneous data such as pathology reports.

Multimodal data fusion for cancer biomarker discovery with deep learning

Article 06 April 2023

Harnessing multimodal data integration to advance precision oncology

Article 18 October 2021

Multimodal multi-instance evidence fusion neural networks for cancer survival prediction

Article Open access 26 March 2025

Introduction

Recent advances in computational oncology have been fueled by the increasing digitization of diverse biomedical data, including structured clinical variables (such as demographics, tumor staging, and laboratory results), unstructured clinical narratives (such as pathology reports, radiology reports, and physician notes), medical imaging (radiology scans and whole-slide images or WSI), and high-dimensional molecular profiles^1,2,3,4,5,6. This wealth of multimodal data offers unprecedented opportunities to improve patient stratification, predict treatment response, and model disease progression^2,3,6. In parallel, the adaptation of deep learning techniques from computer vision and natural language processing has enabled powerful solutions in these domains^3,7. However, a fundamental challenge remains: the absence of robust, generalizable methods for integrating these heterogeneous data sources into unified representations that capture the biological complexity of cancer and support predictive modeling⁸.

Although large-scale biomedical data is increasingly available and actively analyzed in oncology, it remains fragmented across distinct modalities, such as clinical data (structured variables and unstructured narratives), radiological and pathological imaging, and molecular profiles, which are typically processed separately. This siloed approach limits the ability to integrate complementary information across modalities for unified, patient-centered analysis^8,9. Availability of large-scale datasets and advances in self-supervised learning have enabled the development of foundation models (FMs)^1,10,11. These models, pretrained on text, imaging, or molecular data, have advanced feature extraction within individual modalities by learning latent representations that capture domain-specific patterns. These modality-specific embeddings can be adapted for downstream oncology tasks such as cancer classification or overall survival (OS) prediction. However, in practice, these models are typically applied within single- or dual-modality workflows, leaving the complementary information across modalities underutilized¹². While multimodal data availability continues to expand in oncology, a critical bottleneck remains: the absence of standardized, scalable frameworks that integrate modality-specific embeddings into unified, patient-level representations that capture multimodal patient similarity and support downstream oncology tasks.

We hypothesize that integrating FM-derived embeddings from multiple data modalities can yield richer and more clinically informative patient representations, particularly in settings where clinical data are incomplete or less structured. Rather than relying solely on model scaling or increasing parameter counts, we propose that fusing complementary information from diverse biomedical data types offers a powerful, orthogonal approach to enhance predictive performance in oncology. To test this hypothesis, we present HONeYBEE or Harmonized ONcologYBiomedical Embedding Encoder (https://lab-rasool.github.io/HoneyBee/). HONeYBEE is an open-source framework that generates individual patient-level embeddings from (i) structured and unstructured clinical data, (ii) pathology reports, (iii) radiologic images, (iv) WSIs, and (v) molecular profiles using modality-specific FMs. HONeYBEE integrates these embeddings via concatenation, mean pooling, and Kronecker product fusion strategies to create unified, multimodal representations optimized for downstream oncology tasks, including cancer subtype classification, patient clustering, OS prediction, and patient similarity retrieval.

While numerous models and pipelines exist for analyzing clinical, imaging, and molecular data, most current tools remain modality-specific and lack the flexibility to support unified, end-to-end multimodal workflows^1,13. Existing methods are typically implemented as isolated codebases with rigid dependencies, domain-specific interfaces, and limited extensibility, which complicates reproducibility and impedes multimodal experimentation¹⁴. Moreover, the absence of standardized pipelines for modality-specific embedding generation, harmonization, and flexible fusion introduces substantial technical barriers, slowing the development of clinically meaningful AI models¹⁵. Addressing these limitations requires not only access to multimodal data but also modular infrastructure capable of generating, integrating, and utilizing diverse patient-level embeddings in scalable, reproducible ways.

HONeYBEE directly addresses this gap by providing a modular, open-source framework for multimodal embedding generation and integration. Built around domain-specific FMs, HONeYBEE supports the standardized preprocessing and representation of five key oncology data modalities^{8,16,17,18,19,20}. Each modality is processed through dedicated pipelines, producing modality-specific embeddings optimized for downstream use. These embeddings can then be integrated using flexible fusion strategies. Through a simplified API and modular design, HONeYBEE enables oncology researchers to easily incorporate new FMs, compare embedding strategies, and deploy multimodal embeddings for new tasks. By abstracting away complexities related to data harmonization, preprocessing, and embedding fusion, HONeYBEE facilitates scalable, reproducible, and clinically meaningful multimodal analysis. An overview of the framework’s architecture and supported modalities is shown in Fig. 1.

**Fig. 1: Overview of the HONeYBEE framework.**

To further support adoption and interoperability, HONeYBEE is designed for seamless integration with widely used biomedical data repositories and machine learning platforms. It supports direct data ingestion from resources such as the NCI Cancer Research Data Commons (CRDC), Proteomics Data Commons (PDC), Genomic Data Commons (GDC), Imaging Data Commons (IDC), and The Cancer Imaging Archive (TCIA). The framework is fully compatible with PyTorch, Hugging Face, and FAISS, and includes pretrained FMs along with pipelines for adding new models and modalities. Unlike existing single-modality tools or highly customized pipelines, HONeYBEE offers flexible deployment, standardized embedding workflows, and minimal-code implementation of state-of-the-art techniques^21,22,23. In the sections that follow, we evaluate HONeYBEE’s performance across key oncology tasks, including clustering, cancer subtype classification, OS prediction, and patient similarity retrieval, using the TCGA dataset to demonstrate its effectiveness as a generalizable platform for multimodal cancer research.

Results

We evaluated HONeYBEE using multimodal patient-level data from The Cancer Genome Atlas (TCGA), encompassing 11,428 patients across 33 cancer types. Available data modalities included clinical text (11,428 patients), molecular profiles (13,804 samples from 10,938 patients), pathology reports (11,108 patients), WSIs (8060 patients), and radiologic images (1149 patients). This heterogeneous, incomplete modality availability reflects real-world clinical data constraints and allowed evaluation of HONeYBEE’s robustness in handling missing data.

HONeYBEE-generated embeddings were assessed across four core downstream tasks: cancer type classification, patient similarity retrieval, cancer-type clustering, and OS prediction. Analyses evaluated both modality-specific embeddings and integrated multimodal embeddings generated using fusion strategies such as concatenation, mean pooling, and Kronecker product. The modular design of HONeYBEE accommodated patients with missing modalities without requiring complete-case cohorts.

Modality-specific embedding generation

HONeYBEE supports generating modality-specific embeddings from five primary data types using state-of-the-art FMs. This section describes the default models integrated into HONeYBEE for each data modality, explaining the rationale behind their selection and their specific capabilities. While we present and evaluate specific models in our analysis, any FMs from Hugging Face can be added to HONeYBEE.

For clinical text and pathology reports, HONeYBEE supports multiple language models: GatorTron¹⁷, Qwen3²⁴, Med-Gemma²⁵, and Llama-3.2²⁶. For the primary analyses presented in this paper (cancer type classification, survival prediction, and patient retrieval), we used GatorTron embeddings due to their specialized training on clinical text. The comparative evaluation of all four language models (GatorTron, Qwen3, Med-Gemma, and Llama-3.2) is presented separately in the section “Comparative evaluation of language models for test embeddings” to demonstrate the framework’s flexibility in supporting diverse architectural paradigms.

For WSIs embedding generation, HONeYBEE supports UNI (ViT-L/16, 307M parameters), UNI2-h (ViT-g/14, 632M parameters), and Virchow2 (DINOv2, 1.1B parameters)^16,27. UNI provides 1024-dimensional embeddings optimized for efficiency, making it suitable for large-scale processing with limited computational resources. UNI2-h, with its larger architecture, offers enhanced feature extraction capabilities, while Virchow2, built on the DINOv2 framework, provides the most comprehensive representations through self-supervised learning on diverse pathology datasets. For the primary analyses in this paper, we used UNI embeddings, while the comparative evaluation of staining impact across all three models is presented in Supplementary Materials (see Supplementary Note—Staining impact of performance, Supplementary Figs. 10 and 11, and Supplementary Tables 7 and 8).

For the radiological imaging data, HONeYBEE supports RadImageNet, a convolutional neural network (CNN) pre-trained on over four million medical images across CT, MRI, and PET modalities²⁰. The model’s multi-modality pretraining enables it to handle the diverse imaging protocols commonly encountered in oncology, from contrast-enhanced CT scans to functional PET imaging. RadImageNet embeddings were used for all radiology analyses presented in this paper.

For the molecular data, HONeYBEE includes SeNMo¹⁹. SeNMo is a self-normalizing deep learning encoder specifically designed for high-dimensional multi-omics data, including gene expression, DNA methylation, somatic mutations, miRNA, and protein expression¹⁹. The self-normalizing properties of SeNMo ensure stable training even with the diverse scales and distributions present in multi-omics data. SeNMo embeddings were used for all molecular analyses presented in this paper.

Modality-specific embeddings are generated using standardized pipelines provided in the HONeYBEE framework, which is fully compatible with the Hugging Face ecosystem. This design choice enables easy integration of new FMs as they become available. HONeYBEE’s API abstracts the complexity of model-specific preprocessing requirements while maintaining flexibility. The resulting patient-level feature vectors, along with associated metadata, have been publicly released via Hugging Face repositories including TCGA (https://huggingface.co/datasets/Lab-Rasool/TCGA), CGCI (https://huggingface.co/datasets/Lab-Rasool/CGCI), Foundation Medicine (https://huggingface.co/datasets/Lab-Rasool/FM), CPTAC (https://huggingface.co/datasets/Lab-Rasool/CPTAC), and TARGET (https://huggingface.co/datasets/Lab-Rasool/TARGET).

Multimodal embedding integration and analysis

For multimodal integration, we implemented three fusion strategies within HONeYBEE to accommodate heterogeneous data availability across 11,424 patients (99.97% of the cohort) who had at least two available modalities: (1) concatenation, which joins available embeddings while preserving modality-specific information; (2) mean pooling, which averages embeddings after padding to a common dimension; and (3) Kronecker product, which captures pairwise interactions between modalities. To evaluate the resulting multimodal embedding spaces, we quantified cancer-type separability using normalized mutual information (NMI) and adjusted mutual information (AMI), which respectively measure the alignment of embedding clusters with known cancer-type labels and clustering quality adjusted for chance.

Table 1 presents both clustering metrics (NMI and AMI) for single and multimodal embeddings and for all three integration strategies. Surprisingly, clinical embeddings alone yielded the strongest cancer-type clustering, with an NMI of 0.7448 and AMI of 0.702, outperforming both other single modalities and all multimodal fusion strategies. This result likely reflects the curated nature of clinical documentation in TCGA, where key diagnostic variables, such as cancer subtype, stage, grade, and molecular markers, are extracted and recorded by clinical experts, effectively summarizing information that may be dispersed across raw radiology, pathology, and molecular data. However, all three multimodal fusion approaches, concatenation, mean pooling, and Kronecker product, outperformed weaker single modalities such as molecular, radiology, and WSI embeddings. Among fusion methods, concatenation achieved the best clustering performance, with an NMI of 0.4440 and AMI of 0.347.

Table 1 Performance of multimodal fusion strategies vs. single-modality embeddings for cancer-type clustering

Full size table

Although multimodal fusion did not surpass clinical embeddings in clustering performance, it provided a robust approach to integrating heterogeneous data types and accommodating missing modalities. Visualization of the concatenated multimodal embedding space demonstrated clearer cancer-type separation compared to weaker single modalities, such as radiology and WSI (Supplementary Figures 6-8). While clinical text remained the dominant signal for cancer-type differentiation, multimodal fusion effectively combined information from lower-performing modalities, supporting more comprehensive patient-level representations in cases with limited clinical data.

We further evaluated the clustering behavior of each individual modality to establish a baseline for cancer-type separability. Clinical embeddings demonstrated the strongest performance, with clearly defined boundaries between 33 distinct cancer types and achieving 90.21% classification accuracy using a simple random forest classifier. Molecular embeddings (13,804 samples) showed structured but overlapping clusters, reflecting biological similarities among related cancers. Pathology report embeddings (11,108 samples) exhibited moderate cancer-type clustering, while radiology embeddings (1149 samples) displayed diffuse clusters, consistent with limited data availability and lower cancer-type specificity. Whole-slide image embeddings (8060 samples) showed minimal clustering, likely due to slide-level variability and indirect representation of cancer-type features. Detailed t-SNE visualizations for all modalities are provided in the Supplementary Materials (Supplementary Figures 1-6 and Supplementary Note - Embeddings visualization).

Evaluation of multimodal embeddings on downstream tasks

We evaluated HONeYBEE-generated embeddings on three key downstream tasks: OS prediction, cancer type classification, and patient similarity retrieval. These tasks were selected to assess the generalizability and practical utility of both modality-specific and multimodal embeddings. Given the heterogeneous availability of clinical, radiologic, WSI, and molecular data across patients, we evaluated both standalone and integrated embeddings to assess robustness under real-world data constraints.

We performed survival analysis stratified across all 33 TCGA cancer types, training models on each cancer type individually. For each cancer type and modality combination, we performed 5-fold cross-validation with stratification based on survival outcomes and censoring status to ensure representative distributions across folds. This stratification approach maintains balanced representation of both event rates and follow-up durations in each fold, providing reliable concordance index estimates. For cancer types with limited samples, we merged similar cancers (e.g., TCGA-COAD and TCGA-READ for colorectal cancers) to ensure model training with adequate sample sizes. For each cancer type, we trained three widely used survival models: Cox proportional hazards (CoxPH), Random Survival Forests (RSF), and DeepSurv²⁸. Each model was trained using both single-modality embeddings (clinical, pathology reports, molecular, WSI, and radiology images, where available) and three types of multimodal embeddings: concatenated, mean-pooled, and Kronecker product features. Predictive performance was measured using the concordance index (C-index), which quantifies the agreement between predicted risk scores and observed survival times.

Table 2 presents a comparison of HONeYBEE embeddings against established baseline methods for survival prediction in two representative cancer types: breast cancer (TCGA-BRCA) and bladder cancer (TCGA-BLCA). Baseline models, including SurvPath²⁹, ABMIL³⁰, MCAT³¹, MOTCat³², Porpoise³³, and PathOmic³⁴, achieved concordance indices (C-indices) ranging from 0.566 to 0.655, representing the current state-of-the-art in multimodal survival prediction. In contrast, HONeYBEE embeddings demonstrated substantially superior performance. Clinical embeddings achieved the highest single-modality results using Cox models, outperforming all baseline methods and reflecting the rich prognostic information captured from structured clinical data. This likely reflects the curated nature of TCGA clinical documentation, where key diagnostic variables, such as cancer subtype, stage, grade, and molecular markers, are systematically extracted and recorded by experts, effectively summarizing prognostic signals that may remain distributed across raw imaging, pathology, and molecular data. While multimodal fusion via concatenation produced strong results, it did not surpass clinical features alone within this curated dataset.

Table 2 Comparison of survival prediction methods on TCGA-BRCA and TCGA-BLCA datasets

Full size table

We consider that multimodal fusion holds critical importance in real-world oncology practice, where clinical documentation may be incomplete, inconsistently structured, or lacking detailed annotations. In such settings, integrating complementary signals from radiology, pathology, and molecular data can compensate for missing or unreliable clinical narratives. While TCGA highlights the dominant role of clinical features in expert-curated research datasets, our findings emphasize the potential of multimodal fusion to develop more robust and generalizable survival prediction models applicable to real-world, less-curated healthcare environments.

Extending the analysis to all 33 TCGA cancer types revealed considerable heterogeneity in survival prediction performance across cancers (Supplementary Tables 1–6 and Supplementary Note—Overall survival) revealed considerable heterogeneity in survival prediction performance across cancers. Clinical embeddings achieved C-indices above 0.80 for 27 of 33 cancer types, reaffirming their dominant prognostic role within TCGA. Notable examples included TCGA-PCPG (0.996 ± 0.007), TCGA-THCA (0.985 ± 0.003), and TCGA-UCEC (0.935 ± 0.012). However, certain malignancies were better characterized by other modalities; for instance, molecular embeddings achieved superior performance in TCGA-KICH (0.725 ± 0.173), suggesting a stronger molecular basis for prognosis in this cancer type. Pathology report and WSI embeddings contributed moderate predictive value across most cancers, while radiology embeddings were limited by sample availability, being present for only four cancer types.

Importantly, multimodal fusion demonstrated cancer-type-specific benefits. While concatenation generally yielded the best multimodal performance, it did not consistently outperform clinical embeddings alone. However, for specific cancers, multimodal integration enhanced prognostic accuracy beyond what clinical features captured alone. For example, in TCGA-UCS, multimodal fusion improved C-index from 0.794 ± 0.101 (clinical) to 0.836 ± 0.070; in TCGA-UVM, from 0.844 ± 0.042 to 0.860 ± 0.084; and in TCGA-THYM, from 0.978 ± 0.032 to 0.983 ± 0.033. These improvements suggest that, for certain cancer types, integrating complementary information from multiple modalities enables more accurate survival predictions than relying solely on clinical documentation.

To assess the risk stratification performance of our survival models, we examined Kaplan–Meier survival curves based on model-predicted risk groups (Supplementary Figs. 7–9), which illustrate the performance of Cox proportional hazards, RFS, and DeepSurv models, respectively, for bladder cancer (TCGA-BLCA). These visualizations demonstrate how clinical embeddings consistently achieve superior patient stratification compared to other modalities, with clear separation between low, medium, and high-risk groups (log-rank p < 0.001 for clinical features across all models).

HONeYBEE embeddings were evaluated for cancer type classification using Random Forest models trained on the full TCGA dataset with all five data modalities. We conducted 10 classification experiments with different random seeds, using an 80/20 train-test split with stratification to maintain class balance. Each Random Forest model comprised 100 estimators.

Table 3 summarizes the cancer type classification performance across individual modalities and multimodal fusion strategies. Performance varied substantially across modalities, reflecting differences in the diagnostic information available within each data type. Clinical embeddings achieved the highest performance among single modalities, underscoring the rich diagnostic content captured in clinical narratives. Pathology reports also demonstrated strong discriminative power, while molecular profiles provided moderate performance. In contrast, imaging modalities, radiology and WSI performed considerably lower, likely due to greater intra-class variability and the indirect relationship between imaging features and cancer type.

Table 3 Cancer type classification performance across modalities and fusion methods

Full size table

Notably, multimodal fusion methods demonstrated the superiority of integrative approaches, with concatenation achieving the highest overall accuracy, representing a marginal but consistent improvement over the best individual modality. The Kronecker product fusion achieved comparable performance, suggesting that modeling cross-modal interactions provides an additional discriminative signal. Mean pooling also showed strong performance, though lower than the other fusion methods, indicating that while some modality-specific information is lost through averaging, the method still captures substantial shared signal across modalities. The remarkably small standard deviations across runs (<1% for all methods) demonstrate the exceptional stability of our embedding representations.

We evaluated the semantic quality of HONeYBEE embeddings through similarity-based retrieval using Facebook AI Similarity Search (FAISS) open-source library, which enables efficient nearest-neighbor search across dense vector spaces³⁵. For each modality, retrieval precision@k was calculated, measuring the proportion of patients with the same cancer type among the k nearest neighbors. Higher precision@k values indicate that embeddings effectively capture phenotypic similarity relevant to cohort matching and cancer-type discrimination.

Table 4 summarizes the retrieval performance across all modalities and fusion methods. Clinical embeddings demonstrated superior retrieval capability, substantially outperforming other individual modalities. In contrast, WSI embeddings showed the poorest performance, likely due to their slide-level rather than patient-level representation. All multimodal fusion methods underperformed relative to clinical embeddings alone, with concatenation achieving the best fusion performance.

Table 4 Retrieval performance metrics across modalities and fusion methods

Full size table

These results suggest that integrating weaker modalities diluted the strong semantic signal present in clinical embeddings, reducing retrieval precision despite modest improvements in clustering metrics (e.g., AMI). This highlights an important distinction between classification and retrieval tasks: while multimodal integration enhances overall class separation, clinical embeddings alone provide the most semantically meaningful representations for similarity-based patient retrieval.

The failure analysis, given in Table 5 revealed specific patterns of confusion between cancer types. For clinical embeddings, the most commonly confused pairs included READ as COAD (562 cases), LUAD as LUSC (480 cases), and LUSC as LUAD (401 cases), reflecting known biological similarities between these cancer types. WSI embeddings showed extreme confusion rates, with 85.7% of queries failing to retrieve same-type patients in the top 10 results.

Table 5 Top confusion patterns in retrieval failures for each modality

Full size table

Comparative evaluation of language models for text embeddings

We conducted a systematic evaluation of four LLMs for generating clinical text embeddings: GatorTron (medical encoder model)¹⁷, Qwen3 (general-purpose encoder)²⁴, Med-Gemma (medical decoder model)²⁵, and LLaMA (general-purpose decoder)²⁶. These models were evaluated using clinical notes and pathology reports from the TCGA dataset across three key tasks: cancer type classification, patient retrieval, and overall survival prediction. Figure 2 presents t-SNE visualizations of the embedding spaces produced by each model, illustrating distinct differences in how various architectures and training strategies encode cancer-specific information. The degree of cluster separation and organization varied markedly between models, with direct implications for downstream task performance.

**Fig. 2: t-SNE visualization of clinical text embeddings generated by four language models, colored by TCGA cancer type.**

Table 6 presents the classification accuracy across all four models for both text modalities. Clinical text embeddings demonstrated better performance, with accuracies exceeding 98%. Notably, the general-purpose encoder Qwen achieved the highest accuracy (99.95%), suggesting that broad language understanding capabilities translate effectively to clinical text representation. Cancer type classification using pathology reports proved more challenging, with accuracies ranging from 78.41% to 84.90%, where decoder-only models showed a slightly better performance.

Table 6 Cancer type classification performance (mean accuracy ± std) across 10 runs of Random Forest classifier for 33 TCGA cancer types using embeddings from clinical data and pathology reports

Full size table

We evaluated the quality of embeddings through retrieval tasks, measuring precision@k for identifying patients with the same cancer type. As shown in Table 7, clinical embeddings demonstrated superior retrieval performance, with Qwen achieving near-perfect precision@10 (99.57%). The substantial performance gap between clinical and pathology report embeddings highlights the structured nature of clinical narratives compared to the more heterogeneous pathology reports.

Table 7 Retrieval performance metrics for patient similarity search across 33 TCGA cancer types using embeddings from four language models applied to clinical data and pathology reports

Full size table

To assess the clinical utility of embeddings, we conducted stratified survival analysis using Cox proportional hazards, random survival forests, and DeepSurv models. Table 8 summarizes the average concordance indices across 33 cancer types. Clinical embeddings consistently achieved C-indices above 0.84, demonstrating strong prognostic value. In contrast, pathology report embeddings showed limited predictive power (C-index 0.58), suggesting that current language models may not fully capture the prognostic information embedded in pathological descriptions.

Table 8 Average concordance indices for overall survival prediction across 33 TCGA cancer types using embeddings from four language models applied to clinical data and pathology reports

Full size table

Given the lower baseline performance on pathology reports, we investigated whether task-specific fine-tuning could improve embedding quality. We trained lightweight neural network classifiers on top of frozen embeddings and compared their performance to the baseline random forest approach. As shown in Table 9 fine-tuning yielded substantial improvements across all models, with accuracies increasing by 7.8–12.7 percentage points.

Table 9 Cancer type classification accuracy improvement for 33 TCGA cancer types using pathology report embeddings

Full size table

Figure 3 shows a visual representation of the embedding space using t-SNE projections before and after fine-tuning the LLMs using task-specific data. The fine-tuned embeddings demonstrate markedly improved cluster separation. This improvement may potentially translate to better downstream task performance while maintaining computational efficiency through the parameter-efficient fine-tuning approaches.

**Fig. 3: t-SNE visualization of embeddings generated using four language models for the pathology reports.**

Discussion

In this study, we introduced HONeYBEE, a modular, open-source framework designed to generate unified, patient-level embeddings from diverse biomedical data modalities, including clinical text, pathology reports, radiology images, WSIs, and molecular profiles. By leveraging domain-specific FMs and supporting flexible multimodal fusion strategies, HONeYBEE enables comprehensive representation learning suitable for a wide range of oncology tasks. Using multimodal data from over 11,400 patients spanning 33 cancer types in TCGA, we demonstrated that HONeYBEE-generated embeddings are robust, generalizable, and consistently effective across classification, OS prediction, and patient retrieval tasks. These findings establish HONeYBEE as a scalable and extensible platform for multimodal oncology research. The framework and resulting embeddings are publicly available.

Beyond its empirical performance, HONeYBEE offers technical advantages through its unified architecture. Existing multimodal AI development often relies on fragmented pipelines requiring manual coordination of modality-specific tools such as OHDSI for structured clinical data, PyDicom for radiology imaging, OpenSlide for pathology images, and TIAToolbox for histopathology processing^36,37,38,39. This fragmentation introduces significant technical overhead when harmonizing cross-modal metadata and managing inconsistent APIs. HONeYBEE resolves these challenges through a modular, plug-and-play framework that standardizes data ingestion, embedding generation, and fusion via a unified API and encoder architecture⁸. Leveraging pretrained FMs from the Hugging Face ecosystem¹, HONeYBEE supports scalable and efficient embedding generation across diverse modalities, while accommodating missing data without requiring complete-case cohorts.

Our comprehensive evaluation revealed that clinical embeddings consistently dominated across clustering, classification, retrieval, and survival prediction tasks. This finding likely reflects the curated nature of TCGA clinical documentation, where key diagnostic variables, such as cancer subtype, stage, grade, and molecular markers, are systematically extracted by clinical experts and recorded as structured narratives. These expert-curated summaries encapsulate prognostically relevant information that remains distributed across raw imaging, pathology, and molecular data, explaining why clinical embeddings achieved the strongest cancer-type separability (NMI: 0.7448, AMI: 0.702), highest classification accuracy (98.5%), and top retrieval precision@10 (96.4%). Clinical embeddings also consistently achieved the highest C-indices in survival prediction across 27 of 33 cancer types. These results underscore the critical role of structured clinical documentation in oncology AI workflows, at least in curated datasets like TCGA.

While clinical embeddings outperformed other modalities in most tasks, multimodal integration demonstrated significant value in specific contexts. Particularly in OS prediction, multimodal fusion using concatenation improved prognostic performance for certain cancer types, such as TCGA-UCS, TCGA-UVM, and TCGA-THYM, where complementary information from molecular, pathology, and imaging modalities contributed to enhanced predictions.

More broadly, fusion strategies like concatenation consistently outperformed weaker individual modalities (e.g., radiology, WSI) across clustering and classification tasks. These findings suggest that while clinical data dominate in curated datasets, multimodal fusion holds critical importance for developing robust models in real-world settings where clinical documentation may be incomplete, inconsistent, or less structured.

In our comparative evaluation of four language models (GatorTron, Qwen3, Med-Gemma, and LLaMA-3.2), we found that general-purpose models like Qwen3 achieved superior clustering (AMI: 0.79, NMI: 0.82), classification accuracy (99.95%), and retrieval precision compared to domain-specific models like GatorTron and Med-Gemma when applied to clinical text. This suggests that large-scale general-purpose models may encode broader semantic structures that transfer effectively to clinical tasks, especially when applied without further fine-tuning. However, domain-specific models like GatorTron still demonstrated strong and consistent performance, particularly in OS prediction tasks, reinforcing the relevance of clinical pretraining for capturing prognostic signals. These findings indicate that while general-purpose models can outperform specialized architectures in certain tasks, optimal model selection depends on task-specific objectives and data characteristics.

Additionally, fine-tuning significantly enhanced the performance of language models on pathology reports, a modality where baseline embeddings performed suboptimally. Across all models, fine-tuned embeddings improved cancer type classification accuracy by 7.8–12.7 percentage points and increased AMI scores from approximately 0.35–0.39 to 0.91–0.93. This substantial gain underscores the importance of task-specific adaptation when processing heterogeneous and less-structured modalities like pathology narratives. These results highlight that parameter-efficient fine-tuning can effectively bridge the performance gap between raw and curated text, suggesting that real-world deployment of clinical AI systems should incorporate fine-tuning pipelines to optimize embeddings for downstream oncology tasks.

In summary, HONeYBEE establishes a versatile, open-source foundation for multimodal representation learning in oncology, validated across diverse tasks and data modalities. Our findings emphasize the dominant role of clinical text in structured datasets like TCGA, while highlighting the potential of multimodal fusion and targeted fine-tuning to improve performance in real-world, less-curated environments. Future work will focus on extending HONeYBEE to prospective clinical data, evaluating embedding utility for treatment response prediction, and integrating additional modalities such as genomics, radiomics, and wearable sensor data. By facilitating modular, task-agnostic embedding generation, HONeYBEE offers a scalable platform for advancing precision oncology research and clinical applications.

Furthermore, HONeYBEE’s unified patient-level embeddings enable practical applications across oncology research and clinical workflows. These include biomarker discovery, cross-modal hypothesis generation, patient similarity search, cohort identification, and clinical decision support^{1,40,41,42,43}. Generated embeddings are compact and interoperable with vector database systems such as FAISS and Annoy^35,44, supporting deployment in retrieval-augmented generation, risk stratification, and clinical triage pipelines. Critically, since embeddings can be generated locally without sharing raw data, HONeYBEE facilitates federated learning collaborations⁴⁵, promoting secure, multi-institutional model development across diverse oncology datasets.

Methods

HONeYBEE is a modular, open-source framework designed to enable scalable, reproducible, and standardized preprocessing of multimodal oncology data for downstream AI application development, machine learning, and representation learning tasks. The framework provides dedicated processing pipelines for four primary biomedical data modalities: whole slide histopathology images, radiology imaging, molecular profiles, and clinical text. Each modality-specific pipeline incorporates state-of-the-art preprocessing techniques tailored to the data type and is designed to generate high-quality, fixed-length data representations (embeddings) using domain-specific FMs. These embeddings are stored in a structured and accessible format to support classification, retrieval, survival analysis, and other downstream applications.

The HONeYBEE architecture supports the integration of both public and institutional datasets and emphasizes interoperability through standardized APIs and metadata handling. Pretrained and custom models are supported via Hugging Face, with any open-source FM whose weights are available on Hugging Face being easily replaceable with the existing one seamlessly for the given modality. Outputs are organized for compatibility with modern AI pipelines. Figure 1 provides an overview of the framework’s architecture, highlighting the sequence of data ingestion, preprocessing, embedding generation, and downstream application.

Pathology image processing

Whole slide images (WSIs) present unique computational challenges due to their extreme size, multi-resolution pyramid structure, and vendor-specific file formats. HONeYBEE addresses these challenges through an integrated processing pipeline specifically designed for computational pathology workflows. Our implementation leverages GPU acceleration to efficiently handle WSI loading, preprocessing, tissue detection, normalization, and feature extraction, as illustrated in Fig. 4A.

**Fig. 4: Pathology and radiology imaging data processing pipelines implemented in HONeYBEE.**

HONeYBEE utilizes CuImage⁴⁶ for efficient loading and handling of WSIs, supporting multiple vendor-specific formats, including Aperio SVS, Philips TIFF, and generic tiled multi-resolution RGB TIFF files. CuImage provides GPU-accelerated I/O capabilities through CUDA⁴⁶, enabling fast loading of specific regions of interest while managing memory constraints inherent in working with gigapixel-scale images. The library supports various compression schemes, including JPEG, JPEG2000, LZW, and Deflate, providing flexibility in handling different WSI formats while maintaining processing speed.

The framework implements metadata handling to preserve critical image properties and structure. This includes maintaining physical spacing information in micrometers, coordinate system specifications, and vendor-specific metadata such as objective magnification and microns-per-pixel ratios. For formats like Aperio SVS, detailed scan parameters and ICC color profiles are preserved, allowing for accurate spatial measurements and color reproduction in downstream analysis tasks.

HONeYBEE’s multi-resolution pyramid management allows efficient access to WSI data at different magnification levels. The system maintains resolution information, including dimensions, downsampling factors, and tile specifications for each level. This enables memory-efficient implementation of various analysis strategies, from rapid whole-slide thumbnail generation to high-resolution region analysis. The framework implements on-demand loading of specific regions at desired resolution levels, supporting both CPU and GPU memory targets through device specification. When available, the system can leverage NVIDIA GPUDirect Storage for direct data transfer from storage to GPU memory, further reducing I/O bottlenecks⁴⁷.

Memory efficiency is achieved through intelligent region reading strategies that maintain coordinate consistency across resolution levels. Users can specify locations in base-resolution coordinates while accessing data at any pyramid level, simplifying the development of multi-scale analysis pipelines. The implementation includes automatic memory management through caching strategies, optimizing performance while preventing memory overflow when working with multiple large WSIs simultaneously.

Effective tissue detection and patch extraction are essential preprocessing steps for computational pathology analysis. As shown in Fig. 4, HONeYBEE implements two distinct approaches for tissue detection: a classical threshold-based method using Otsu’s algorithm⁴⁸ and a deep learning-based approach utilizing a pretrained DenseNet Slidl model⁴⁹.

The Otsu-based method operates on the assumption that tissue regions exhibit distinct intensity characteristics compared to the background. After converting the image to grayscale, the algorithm automatically determines an optimal threshold that maximizes the variance between tissue and background classes. As shown in Fig. 4, this approach generates a gradient magnitude map and binary tissue mask, followed by grid-based patch extraction. While computationally efficient, this method works best with images that have good contrast between tissue and background regions.

The deep learning-based approach employs a pretrained DenseNet model that classifies image patches into three categories: tissue, background, and noise. This model, originally trained by Slidl⁴⁹, provides more robust tissue detection and can identify artifacts such as pen markings that should be excluded from analysis. Figure 4 demonstrates the model’s capability to distinguish between actual tissue regions, background, and artifacts like pen markings, enabling more precise patch extraction. However, even this sophisticated approach can face challenges with certain types of slides.

Following tissue detection, HONeYBEE implements an intelligent patch extraction strategy that maximizes the coverage of relevant tissue regions while minimizing computational overhead. The framework divides detected tissue regions into standardized patches with configurable size parameters (typically 256 × 256 or 512 × 512 pixels). Patches are filtered based on tissue content thresholds to ensure meaningful analysis, and their coordinates are maintained relative to the original WSI for spatial reference. The extracted patches can be efficiently loaded into memory as needed, enabling scalable processing of large whole slide images while maintaining spatial context for downstream analysis tasks.

For whole-slide images, patch-level embeddings from the UNI model are aggregated to patient-level representations using mean pooling across all tissue-containing patches. While this approach may not capture the full complexity of histopathological patterns compared to specialized Multiple Instance Learning methods, it provides a standardized baseline representation compatible with HONeYBEE’s unified embedding framework.

Histopathological image appearance varies significantly due to differences in tissue preparation, scanner specifications, and laboratory protocols. These variations negatively impact both visual interpretation and computational analysis. HONeYBEE addresses this challenge by implementing three state-of-the-art stain normalization methods, as illustrated in Fig. 4B:

Reinhard normalization: This technique performs color normalization by matching statistical properties of the source image to a target image in LAB color space. While computationally efficient, it does not explicitly account for specific hematoxylin and eosin stain characteristics⁵⁰.
Macenko normalization: This method estimates stain vectors through singular value decomposition in optical density space, enabling separation of hematoxylin and eosin contributions before normalization. This approach better preserves relative staining patterns while standardizing overall appearance⁵¹.
Vahadane normalization: Using sparse non-negative matrix factorization, this technique decomposes images into stain density maps, offering robust stain separation even with varying tissue characteristics or additional stain presence⁵².

All three normalization methods in HONeYBEE are GPU-accelerated, providing significant performance improvements over CPU-based alternatives (average 8–12 × speedup on typical hardware configurations). Our implementation maintains consistency in normalization results across different image scales through automated parameter adjustment and handles memory management for large WSIs through efficient tiling strategies. A comprehensive evaluation of how these normalization methods impact the performance of different foundation models is presented in the Supplementary Materials (see Supplementary Note—Staining impact of performance, Supplementary Figs. 10 and 11, and Supplementary Tables 7 and 8), where we demonstrate that the benefit of stain normalization varies significantly based on model architecture and training strategy.

Beyond normalization, quantitative tissue analysis often requires the isolation of individual stain components. HONeYBEE implements color deconvolution methods to decompose H&E-stained images into constituent components based on the Beer–Lambert law of light absorption, as shown in Fig. 4A.

Our implementation converts RGB images to hematoxylin–eosin–DAB (HED) color space through calibrated deconvolution matrices that account for the characteristic absorption spectra of each stain. This transformation enables separate analysis of (i) the hematoxylin component, highlighting nuclear structures; (ii) the eosin component, emphasizing cytoplasmic and stromal features and (iii) the DAB component, isolating immunohistochemical signals when present.

The framework provides bidirectional conversion between RGB and HED color spaces, allowing researchers to analyze separated stain channels and reconstruct normalized images with modified stain characteristics. These operations are GPU-accelerated and process large image regions through automated tiling strategies to maintain memory efficiency.

Radiology imaging processing

Medical imaging plays a central role in oncology for diagnosis, staging, treatment planning, and response assessment. HONeYBEE implements comprehensive processing pipelines for major radiological modalities, including computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET). Each modality presents unique characteristics that require specialized preprocessing techniques while maintaining standardized outputs for downstream multimodal integration.

HONeYBEE provides support for standard medical imaging formats through integration with industry-standard digital imaging and communications in medicine (DICOM) protocols and specialized neuroimaging formats such as neuroimaging informatics technology initiative (NIfTI)^53,54. Our implementation preserves metadata that influences image interpretation and quantitative analysis. For CT imaging, this includes Hounsfield unit (HU) calibration parameters, exposure settings (kVp, mAs, dose), slice thickness, reconstruction kernel information, and scanner-specific calibration factors. MRI sequence metadata preservation encompasses pulse sequence parameters (TR/TE values), field strength, coil configuration, contrast administration details, and acquisition plane and timing information. PET studies metadata retention includes radiopharmaceutical type and injection parameters, standardized uptake value (SUV) conversion factors, attenuation correction methods, and decay correction timing information⁵⁵.

The framework implements memory-efficient loading strategies through lazy evaluation, allowing researchers to work with large volumetric datasets without excessive memory requirements⁵⁶. This is particularly important for multi-sequence MRI studies or time-series imaging data that can exceed several gigabytes per patient. The implementation supports both single-file and multi-file DICOM series, automatically handling series organization, temporal sequencing, and multi-frame images^57,58.

Isolating relevant anatomical structures and regions of interest forms a critical preprocessing step for targeted analysis of oncological imaging. HONeYBEE incorporates both classical algorithms and deep learning-based approaches for automated segmentation across different imaging modalities^59,60. For CT segmentation, the framework offers automated lung segmentation using threshold-based approaches with morphological refinement, multi-organ segmentation through pre-trained U-Net architectures optimized for thoracic, abdominal, and pelvic regions, automated lung nodule detection using 3D convolutional networks with sensitivity exceeding 85% for nodules larger than 4 mm, and bone and soft tissue separation using HU-based thresholding with anatomical constraints⁶¹.

MRI segmentation capabilities include brain extraction tools optimized for different MRI sequences (T1, T2, FLAIR), multi-parametric tumor segmentation integrating information across multiple sequences, specialized sequence-specific segmentation models for contrast-enhanced regions, and atlas-based segmentation for standardized anatomical referencing⁶². For PET-based segmentation, the framework implements metabolic volume delineation using fixed and adaptive SUV thresholding, gradient-based segmentation for heterogeneous uptake regions, joint PET/CT segmentation leveraging complementary information from both modalities, and automated detection of hypermetabolic regions with statistical deviation analysis⁶⁰.

All segmentation methods are implemented with GPU acceleration when applicable, providing significant performance improvements over CPU-based alternatives^63,64. The framework maintains original image coordinates and transformation matrices, ensuring accurate spatial registration between original images and segmentation masks.

Image noise and artifacts can significantly impact both visual interpretation and computational analysis of medical images. HONeYBEE implements multiple denoising strategies optimized for specific imaging modalities and noise characteristics. CT denoising approaches include non-local means filtering with optimized parameters for different dose levels, deep learning-based denoising through convolutional neural networks trained on paired low-dose/high-dose CT images, edge-preserving bilateral filtering with automatic parameter tuning, and structural fidelity metrics to ensure critical diagnostic features remain uncompromised⁶⁵.

MRI denoising implementations feature Rician noise models specifically designed for magnitude MRI data, non-local means filtering adapted for different sequence types, wavelet-based denoising with soft thresholding, and specialized approaches for diffusion-weighted imaging addressing unique noise characteristics⁶⁶. PET denoising capabilities encompass sinogram-space denoising for raw PET data when available, image-space denoising with SUV preservation guarantees, temporal filtering for dynamic PET acquisitions, and joint PET/CT denoising leveraging structural information from CT⁶⁷.

Our quantitative evaluations demonstrate that these denoising methods achieve an average 40–60% reduction in noise levels while maintaining structural similarity indices (SSIM) above 0.92 compared to reference images. Performance metrics for each method are continuously benchmarked across public datasets to ensure optimal parameter selection⁶⁸.

Variability in spatial resolution and slice thickness across different scanners and protocols necessitates standardization for consistent analysis^59,63. HONeYBEE implements comprehensive resampling capabilities with modality-specific optimization. Resolution standardization includes isotropic voxel resampling with configurable target dimensions, support for both downsampling and upsampling with appropriate filtering, preservation of spatial relationships and anatomical proportions, and multiple interpolation methods including nearest neighbor, linear, B-spline, and Lanczos^56,69.

CT-specific resampling techniques focus on Hounsfield unit preservation during interpolation, thin-slice to thick-slice conversion with appropriate averaging, axial to coronal/sagittal reformatting with consistent spatial referencing, and partial volume effect compensation for quantitative analysis⁷⁰. MRI-specific resampling addresses multi-sequence alignment to establish spatial correspondence, motion correction between sequences acquired at different timepoints, distortion correction for echo-planar imaging sequences, and field inhomogeneity compensation for accurate spatial registration⁷¹. PET-specific resampling includes SUV-preserving interpolation methods, PET/CT alignment with motion compensation, resolution recovery for quantitative accuracy, and partial volume effect correction for small lesions⁷².

Our implementation tracks and propagates spatial transformation matrices through all processing steps, enabling accurate coordinate mapping between original and processed images. This spatial consistency is critical for multimodal integration and region-of-interest analysis⁷³.

Intensity standardization addresses variability in signal intensities across different scanners, protocols, and patient characteristics. HONeYBEE implements modality-specific normalization techniques that preserve quantitative relationships while enabling consistent analysis. CT intensity normalization includes Hounsfield unit verification and calibration, predefined window/level settings for different anatomical regions (lung, soft tissue, bone), adaptive histogram equalization for enhanced contrast in selected regions, and intensity outlier handling for metal artifacts and beam hardening^59,74.

MRI intensity normalization encompasses Z-score normalization on a per-sequence basis, histogram matching to standardized references, bias field correction for signal inhomogeneity, tissue-specific intensity normalization using segmentation priors, and cross-sequence intensity harmonization for multi-parametric analysis^75,76. PET intensity standardization features SUV calculation with support for different normalization factors (body weight, lean body mass, body surface area), time-decay correction for quantitative accuracy, scanner-specific calibration factor application, and cross-scanner harmonization for multi-center studies⁷⁷.

The framework maintains both raw and normalized versions of the imaging data, allowing researchers to access original quantitative values when needed while providing standardized inputs for machine learning algorithms and AI model development⁷⁸.

Molecular data processing

Molecular data refers to biological information derived from molecules within cells, including DNA, RNA, proteins, and metabolites. Figure 5A illustrates the molecular data processing pipeline implemented in HONeYBEE, which handles multiple data modalities, including DNA methylation, gene expression, protein expression, DNA mutation, and miRNA expression data. Molecular data is widely used in biomedical research, personalized medicine, and multi-omics analyses to understand disease mechanisms, predict outcomes, and develop targeted therapies. It provides insights into cellular functions, genetic variations, and biochemical processes. Common types of molecular data include genomic data (DNA sequences, mutations, variations), transcriptomic data (RNA expressions), proteomic data (protein expressions, modifications, interactions), epigenomic data (DNA methylation, histone modifications), and metabolomic data. Although rich in information, using molecular data for analysis presents several challenges across multiple stages, from data acquisition to interpretation. These challenges include varying complexity across modalities, ‘big P–small N’ (very high dimensionality features and low sample size), uncommon standardization, integration, missingness, experimental noise, and high computational and storage demands⁷⁹. Addressing these challenges requires robust statistical methods, computational advancements, standardized protocols, and interdisciplinary collaboration. HONeYBEE framework incorporates all the pre-processing steps as described in the original manuscript of SeNMo framework¹⁹, to ensure the integrity of the data while minimizing the complexity in handling the heterogeneous molecular data modalities. As described in Section 2.1.3 of SeNMo¹⁹, the molecular data processing pipeline involves five data modalities, including DNA methylation, gene expression, protein expression, DNA mutation, and miRNA expression data, acquired from publicly available repositories such as TCGA and the University of California, Santa Cruz (UCSC) Xena portal. This data underwent a series of preprocessing steps, including normalization, scaling, dropping constant/ duplicate/ colinear features, removing low-expression genes, removing NaNs, and handling missing data modalities. This pipeline ensures data harmonization, dimensionality reduction, and the generation of contextually enriched embeddings for downstream machine learning and AI applications. These preprocessed feature sets are then unified into a multimodal feature matrix to facilitate pan-cancer analyses. The key steps in feature selection involve the union of features across modalities and cancer types, elimination of redundancies to exclude duplicates/ colinears, multi-modal integration into a single matrix, and addition of clinical covariates. Refer to SeNMo for details of these steps¹⁹. For patients with multiple molecular profiles, features are aggregated at the patient level before embedding generation.

Clinical data processing

Clinical data in oncology presents unique challenges due to its heterogeneous nature, spanning structured data elements (laboratory values, vital signs, medications) and unstructured narrative text (clinical notes, radiology reports, pathology reports). HONeYBEE implements processing pipelines to handle both data types while maintaining semantic relationships for clinical analysis, as illustrated in Fig. 5B.

The extraction of clinical text from various document formats represents a critical preprocessing step in clinical data analysis. HONeYBEE supports multiple input formats, including PDF, scanned images, and electronic health record (EHR) exports. The framework implements a multi-stage processing pipeline with modality-specific optimizations: For digitized documents, the framework utilizes a multi-stage optical character recognition (OCR) pipeline utilizing Tesseract OCR with specialized medical dictionaries to improve recognition accuracy of clinical terminology⁸⁰. The pipeline includes image preprocessing (deskewing, noise reduction, binarization), layout analysis to preserve document structure, and post-processing with medical terminology verification to identify and correct common OCR errors in clinical contexts⁸¹.

For structured EHR data in tabular formats, HONeYBEE implements a conversion pipeline that transforms discrete data elements into standardized key–value pair representations. This process preserves the semantic relationships inherent in the original data structure while enabling unified processing alongside unstructured text⁸². Laboratory results, vital signs, medication orders, and other structured elements are converted into a consistent text format (e.g., “blood_pressure: 120/80 mmHg", “white_blood_cells: 7.2K/μL”) that can be processed by language models while maintaining clinical meaning and temporal relationships⁸².

Document structure analysis is performed to identify and preserve the hierarchical organization of clinical reports, maintaining relationships between sections, subsections, and individual data elements. The framework implements specialized handlers for common clinical document types, including operative reports, pathology reports, and consultation notes, each with specific rules for extracting structured information from semi-structured text⁸⁰. The framework implements quality control measures to flag problematic documents and provides confidence scores for extracted text segments, enabling downstream processes to account for potential extraction errors⁸¹.

Following successful text extraction, HONeYBEE implements tokenization through integration with the Hugging Face Transformers library, enabling compatibility with state-of-the-art clinical language models. The framework supports multiple pretrained tokenizers optimized for biomedical text, including BioClinicalBERT, PubMedBERT, GatorTron, and ClinicalT5, ensuring appropriate handling of domain-specific vocabulary and clinical narratives^17,83. The tokenization pipeline processes extracted clinical text through several stages:

Preliminary cleaning to standardize whitespace, handle special characters, and normalize common clinical abbreviations
Text segmentation into appropriate units (sentences, paragraphs, sections) based on document structure
Subword tokenization using model-specific strategies (WordPiece, Byte-Pair Encoding, SentencePiece)
Special token handling for model-specific requirements (e.g., [CLS], [SEP], [MASK])
Sequence length management with configurable strategies for truncation and sliding windows

To accommodate long clinical documents that exceed typical model input token sequence lengths (generally 512-1024 tokens), HONeYBEE implements multiple strategies: (i) sliding window tokenization with configurable overlap percentages (typically 10–20%), (ii) hierarchical document segmentation preserving section boundaries, (iii) important segment identification using clinical term density heuristics, and (iv) document summarization for extremely long texts using extractive methods.

The framework’s tokenization implementation is optimized for batch processing, enabling efficient handling of large clinical document collections while maintaining memory efficiency through dynamic batch sizing based on available computational resources⁸⁴.

Identification and standardization of clinical concepts are essential for structured analysis of medical narratives. HONeYBEE implements a comprehensive clinical entity recognition pipeline utilizing both rule-based and deep learning approaches^83,85.

Our implementation includes integration with medical ontologies and terminologies, including SNOMED CT, RxNorm, LOINC, and ICD-O-3, enabling normalization of extracted clinical entities to standardized concept identifiers. This normalization process addresses challenges such as synonymy (multiple terms for the same concept), abbreviation expansion based on context, and disambiguation of terms with multiple potential meanings⁸⁵.

For cancer-specific entity recognition, the framework implements specialized models for extracting and normalizing⁸⁶: (i) tumor characteristics (histology, grade, size, invasiveness), (ii) staging information (TNM parameters, stage groupings), (iii) biomarker status (receptor expression, molecular alterations), (iv) treatment details (surgical procedures, chemotherapy regimens, radiation protocols), and (iv) response assessment (RECIST criteria, clinical response categories).

Temporal information extraction identifies and normalizes dates, durations, and relative time expressions within clinical narratives. This enables the construction of longitudinal patient timelines with precise sequencing of diagnostic, treatment, and follow-up events critical for oncology research^87,88.

The extracted and normalized entities maintain provenance links to their source documents and original text spans, facilitating verification and quality assessment of automated extractions. Performance metrics for entity recognition are continuously benchmarked across public datasets^83,85.

Embedding Generation and Storage

Each preprocessed data sample is passed through a pre-trained FM, which produces a fixed-length embedding vector that varies in shape depending on the model architecture. HONeYBEE utilizes GPU acceleration and distributed computing, when available, to efficiently generate embeddings for large-scale datasets. The generated embeddings, along with associated metadata, are stored in a structured format to facilitate downstream tasks such as similarity search, clustering, and AI and machine learning model training. HONeYBEE employs efficient data compression and indexing techniques to optimize the storage and retrieval of high-dimensional embeddings. As an example, the processed molecular data is passed through the SeNMo model, a self-normalizing deep learning encoder, to generate latent feature vectors (embeddings). These embeddings capture critical biological patterns and are suitable for tasks such as survival analysis, cancer subtype classification, and biomarker discovery.

The generated embeddings and tabular data are stored using the Hugging Face datasets library, which provides a standardized interface for data access and processing⁸⁹. The datasets are organized into a structured format, containing the embeddings, metadata, and labels (if available). PyTorch DataLoaders are employed to efficiently load and iterate over the datasets during model training and evaluation, handling tasks such as batching, shuffling, and parallel processing. Additionally, HONeYBEE datasets can be integrated into vector databases such as Faiss and Annoy^35,44 to enable fast similarity search, nearest neighbor retrieval, and clustering on the high-dimensional embedding vectors. These databases are optimized for efficient querying and retrieval of embeddings based on similarity metrics such as Euclidean distance or cosine similarity⁹⁰. By deploying vector databases, researchers can quickly identify similar samples, perform data exploration, and retrieve relevant subsets based on embedding similarity, facilitating various downstream tasks such as retrieval augmented generation (RAG)⁹¹. The structured storage and accessibility components of HONeYBEE ensure that the generated embeddings and associated data are readily available for researchers and practitioners to use in their AI and machine learning pipelines and downstream applications. This research utilized publicly available data from TCGA, which has been previously collected with appropriate ethical approvals and patient consent. No additional human subjects were involved in this study, and therefore, no additional ethics approval was required. The TCGA data were obtained through authorized access and used in accordance with the data use agreements.

Data availability

The datasets generated during the current study are available in the Hugging Face repository (https://huggingface.co/datasets/Lab-Rasool/TCGA). The raw data used in this study were obtained from The Cancer Genome Atlas (TCGA) through the MINDS framework (https://github.com/lab-rasool/MINDS). A comprehensive collection of the datasets and models used in this study is available at https://huggingface.co/collections/Lab-Rasool/honeybee-66b9f15e8b908c4eeeec9cb4.

Code availability

The source code for the HONeYBEE framework is publicly available on GitHub at https://github.com/lab-rasool/HoneyBeeunder an open-source license. Additionally, the MINDS codebase used for data extraction is available at https://github.com/lab-rasool/MINDS. The HONeYBEE paper preprint is available at https://arxiv.org/abs/2405.07460.

References

Boehm, K. M. et al. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer 22, 114–126 (2022).
Article CAS PubMed Google Scholar
Paverd, H. et al. Radiology and multi-scale data integration for precision oncology. NPJ Precis. Oncol. 8, 158 (2024).
Article PubMed PubMed Central Google Scholar
Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).
Article CAS PubMed PubMed Central Google Scholar
Waqas, A. et al. Multimodal data integration for oncology in the era of deep neural networks: a review. Front. Artif. Intell. 7, 1408843 (2024).
Article PubMed PubMed Central Google Scholar
Waqas, A. et al. Digital pathology and multimodal learning on oncology data. BJR∣Artif. Intell. 1, ubae014 (2024).
Google Scholar
Boehm, K. M. et al. Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat. Cancer 3, 723–733 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tripathi, A. Building flexible, scalable, and machine learning-ready multimodal oncology datasets. Sensors 24, 1634 (2024).
Article PubMed PubMed Central Google Scholar
Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed PubMed Central Google Scholar
Bommasani, R. et al. On the opportunities and risks of foundation models. arXiv: 2108.07258 (2022).
Waqas, Asim. Revolutionizing digital pathology with the power of generative artificial intelligence and foundation models.Lab. Investig 103, 100255 (2023).
Google Scholar
Hartsock, I. & Rasool, G. Vision-language models for medical report generation and visual question answering: areview. Front. Artif. Intell. 7, 1430984 (2024).
Article PubMed PubMed Central Google Scholar
Steyaert, S. et al. Multimodal data fusion for cancer biomarker discovery with deep learning. Nat. Mach. Intell. 5, 351–362 (2023).
Article PubMed PubMed Central Google Scholar
Nicora, G. et al. Integrated multi-omics analyses in oncology: a review of machine learning methods and tools. Front. Oncol. 10, 1030 (2020).
Article PubMed PubMed Central Google Scholar
Cooper, M., Ji, Z. & Krishnan, R. G. Machine learning in computational histopathology: challenges and opportunities. Genes Chromosom. Cancer 62, 540–556 (2023).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Article CAS PubMed PubMed Central Google Scholar
Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med. 5, 194 (2022).
Article PubMed PubMed Central Google Scholar
Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 7, 756–779 (2023).
Article PubMed Google Scholar
Waqas, A. et al. Self-normalizing multi-omics neural network for pan-cancer prognostication. Int. J. Mol. Sci. 26, (2025) https://doi.org/10.3390/ijms26157358.
Mei, X. et al. RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiology: Artif. Intell. 4, e210315 (2022).
Google Scholar
Soenksen, L. R. et al. Integrated multimodal artificial intelligence framework for healthcare applications. NPJ Digit. Med. 5, 149 (2022).
Article PubMed PubMed Central Google Scholar
de Mortanges, P. et al. Orchestrating explainable artificial intelligence for multimodal and longitudinal data in medical imaging. NPJ Digit. Med. 7, 195 (2024).
Article Google Scholar
Chang, JeeSuk et al. Continuous multimodal data supply chain and expandable clinical decision support for oncology. npj Digit. Med. 8, 128 (2025).
Article PubMed PubMed Central Google Scholar
Yang, A. et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025).
Google. Medgemma Hugging Face https://huggingface.co/collections/google/medgemma-release-680aade845f90bec6a3f60c4 (2025).
Grattafiori, A. et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
Zimmermann, E. et al. Virchow2: scaling self-supervised mixed magnification models in pathology. arXiv preprint arXiv:2408.00738 (2024).
Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 18, 1–12 (2018).
Article Google Scholar
Jaume, G. et al. Modeling dense multimodal interactions between biological pathways and histology for survival prediction. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11579–11590 (IEEE, 2024).
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Proc. of the 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) Vol. 80 of Proc. of Machine Learning Research 2127–2136 (PMLR, 2018).
Chen, R. J. et al. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In Proc. of the IEEE/CVF International Conference on Computer Vision 4015–4025 (IEEE, 2021).
Xu, Y. & Chen, H. Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction. In Proc. of the IEEE/CVF International Conference on Computer Vision 21241–21251 (IEEE, 2023).
Chen, R. J. et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chen, R. J. et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans. Med. Imaging 41, 757–770 (2020).
Article Google Scholar
Johnson, J., Douze, M. & Jégou, Hervé Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7, 535–547 (2019).
Article Google Scholar
Reich, C. et al. OHDSI Standardized Vocabularies—a large-scale centralized reference ontology for international data harmonization. J. Am. Med. Inform. Assoc. 31, 583–590 (2024).
Article PubMed PubMed Central Google Scholar
Mason, D., et al. (2022). pydicom/pydicom: pydicom 2.3.1. Zenodo. https://doi.org/10.5281/zenodo.7319790.
Goode, A., Gilbert, B., Harkes, J., Jukic, D. & Satyanarayanan, M. OpenSlide: a vendor-neutral software foundation for digital pathology. J. Pathol. Inform. 4, 27 (2013).
Article PubMed PubMed Central Google Scholar
Pocock, J. et al. TIAToolbox as an end-to-end library for advanced tissue image analytics. Commun. Med. 2, 120 (2022).
Article PubMed PubMed Central Google Scholar
Azuaje, F. Artificial intelligence for precision oncology: beyond patient stratification. NPJ Precis. Oncol. 3, 6 (2019).
Article PubMed PubMed Central Google Scholar
Ovchinnikova, K., Born, J., Chouvardas, P., Rapsomaniki, M. & Kruithof-de Julio, M. Overcoming limitations in current measures of drug response may enable AI-driven precision oncology. NPJ Precis. Oncol. 8, 95 (2024).
Article PubMed PubMed Central Google Scholar
Truhn, D., Eckardt, Jan-Niklas, Ferber, D. & Kather, JakobNikolas Large language models and multimodal foundation models for precision oncology. NPJ Precis. Oncol. 8, 72 (2024).
Article PubMed PubMed Central Google Scholar
Fang, C., Xu, D., Su, J., Dry, J. R. & Linghu, B. DeePaN: deep patient graph convolutional network integrating clinico-genomic evidence to stratify lung cancers for immunotherapy. npj Digit. Med. 4, 14 (2021).
Article PubMed PubMed Central Google Scholar
Bernhardsson, E. Annoy: Approximate Nearest Neighbors in C++/Python. Version 1.13.0 https://github.com/spotify/annoy (2018).
Koutsoubis, N. et al. Privacy-preserving federated learning anduncertainty quantification in medical imaging. Radiol. Artif. Intell. 7, 240637 (2025).
Article Google Scholar
RAPIDS AI. cuCIM-RAPIDS GPU-accelerated Image Processing Library https://github.com/rapidsai/cucim (2025).
NVIDIA Corporation. GPUDirect Storage (NVIDIA Corporation, 2025).
Otsu, N. et al. A threshold selection method from gray-level histograms. Automatica 11, 23–27 (1975).
Google Scholar
Berman, A. G., Orchard, W. R., Gehrung, M. & Markowetz, F. SliDL: A toolbox for processing whole-slide images in deep learning. PLoS ONE 18, e0289499 (2023).
Article CAS PubMed PubMed Central Google Scholar
Reinhard, E., Adhikhmin, M., Gooch, B. & Shirley, P. Color transfer between images. IEEE Comput. Graph. Appl. 21, 34–41 (2001).
Article Google Scholar
Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro 1107–1110 (IEEE, 2009).
Vahadane, A. et al. Structure-preserving color normalization and sparse stain separation for histological images. IEEE Trans. Med. imaging 35, 1962–1971 (2016).
Article PubMed Google Scholar
Elhadad, A., Jamjoom, M. & Abulkasim, H. Reduction of NIFTI files storage and compression to facilitate telemedicine services based on quantization hiding of downsampling approach. Sci. Rep. 14, 5168 (2024).
Article CAS PubMed PubMed Central Google Scholar
Levitas, D. et al. ezBIDS: Guided standardization of neuroimaging data interoperable with major data archives and platforms. Sci. data 11, 179 (2024).
Article PubMed PubMed Central Google Scholar
Larobina, M. Thirty Years of the DICOM Standard. Tomography 9, 1829–1838 (2023).
Article PubMed PubMed Central Google Scholar
Schäfer, R. et al. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nat. Comput. Sci. 4, 495–509 (2024).
Article PubMed PubMed Central Google Scholar
Wulms, N. et al. The R package for DICOM to brain imaging data structure conversion. Sci. Data 10, 673 (2023).
Article PubMed PubMed Central Google Scholar
Dhapola, P. et al. Scarf enables a highly memory-efficient analysis of large-scale single-cell genomics data. Nat. Commun. 13, 4616 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024).
Article CAS PubMed PubMed Central Google Scholar
Jin, S., Yu, S., Peng, J., Wang, H. & Zhao, Y. A novel medical image segmentation approach by using multi-branch segmentation network based on local and global information synchronous learning. Sci. Rep. 13, 6762 (2023).
Article CAS PubMed PubMed Central Google Scholar
Primakov, S. P. et al. Automated detection and segmentation of non-small cell lung cancer computed tomography images. Nat. Commun. 13, 3423 (2022).
Article CAS PubMed PubMed Central Google Scholar
D’Antonoli, A. et al. TotalSegmentator MRI: Robust Sequence-independent Segmentation of Multiple Anatomic Structures in MRI. Radiology 314, e241613 (2025).
Article Google Scholar
Reska, D. & Kretowski, M. GPU-accelerated lung CT segmentation based on level sets and texture analysis. Sci. Rep. 14, 1444 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhu, B. et al. Image reconstruction by domain-transform manifold learning. Nature 555, 487–492 (2018).
Article CAS PubMed Google Scholar
Qu, H., Liu, K. & Zhang, L. Research on improved black widow algorithm for medical image denoising. Sci. Rep. 14, 2514 (2024).
Article CAS PubMed PubMed Central Google Scholar
Koonjoo, N., Zhu, B., Bagnall, G. C., Bhutto, D. & Rosen, M. S. Boosting the signal-to-noise of low-field MRI with deep learning image reconstruction. Sci. Rep. 11, 8248 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lu, P. et al. IMC-Denoise: a content aware denoising pipeline to enhance imaging mass cytometry. Nat. Commun. 14, 1601 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sanaei, B., Faghihi, R. & Arabi, H. Employing multiple low-dose PET images (at different dose levels) as prior knowledge to predict standard-dose PET Images. J. Digit. Imaging 36, 1588–1596 (2023).
Article PubMed PubMed Central Google Scholar
Lee, H. et al. Comparative analysis of batch correction methods for FDG PET/CT using metabolic radiogenomic data of lung cancer patients. Sci. Rep. 13, 18247 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mottola, M. et al. Reproducibility of CT-based radiomic features against image resampling and perturbations for tumour and healthy kidney in renal cancer patients. Sci. Rep. 11, 11542 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bernatz, S. et al. Impact of rescanning and repositioning on radiomic features employing a multi-object phantom in magnetic resonance imaging. Sci. Rep. 11, 14248 (2021).
Article CAS PubMed PubMed Central Google Scholar
Leijenaar, RalphT. H. et al. The effect of SUV discretization in quantitative FDG-PET Radiomics: the need for standardized methodology in tumor texture analysis. Sci. Rep. 5, 11075 (2015).
Article CAS PubMed PubMed Central Google Scholar
Marfisi, D. et al. Image resampling and discretization effect on the estimate of myocardial radiomic features from T1 and T2 mapping in hypertrophic cardiomyopathy. Sci. Rep. 12, 10186 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wuschner, A. E. et al. Radiation-induced Hounsfield unit change correlates with dynamic CT perfusion better than 4DCT-based ventilation measures in a novel-swine model. Sci. Rep. 11, 13156 (2021).
Article CAS PubMed PubMed Central Google Scholar
Carré, A. et al. Standardization of brain MR images across machines and protocols: bridging the gap for MRI-based radiomics. Sci. Rep. 10, 12340 (2020).
Article PubMed PubMed Central Google Scholar
Daisaki, H. et al. Usefulness of semi-automatic harmonization strategy of standardized uptake values for multicenter PET studies. Sci. Rep. 11, 8517 (2021).
Article CAS PubMed PubMed Central Google Scholar
Landau, S. M. et al. Positron emission tomography harmonization in the Alzheimeras Disease Neuroimaging Initiative: a scalable andrigorous approach to multisite amyloid and tau quantification. Alzheimer's Dement. 21, e14378 (2025).
Article Google Scholar
Nadav, Y. et al. Intensify3D: Normalizing signal intensity in large heterogeneous image stacks. Sci. Rep. 8, 4311 (2018).
Article Google Scholar
Liao, J. & Chin, K.-V. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23, 1945–1951 (2007).
Article CAS PubMed Google Scholar
Jantscher, M. et al. Information extraction from German radiological reports for general clinical text and language understanding. Sci. Rep. 13, 2353 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ma, Ming-Wei. et al. Extracting laboratory test information from paper-based reports. BMC Med. Inform. Decis. Mak. 23, 251 (2023).
Article PubMed PubMed Central Google Scholar
Theodorou, B., Xiao, C. & Sun, J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat. Commun. 14, 5305 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hassan, E., Abd El-Hafeez, T. & Shams, M. Y. Optimizing classification of diseases through language model analysis of symptoms. Sci. Rep. 14, 1507 (2024).
Article CAS PubMed PubMed Central Google Scholar
Alba, C., Xue, B., Abraham, J., Kannampallil, T. & Lu, C. The foundational capabilities of large language models in predicting postoperative risks using clinical notes. npj Digit. Med. 8, 95 (2025).
Article PubMed PubMed Central Google Scholar
Wang, K. et al. NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding. NPJ Syst. Biol. Appl. 7, 38 (2021).
Article PubMed PubMed Central Google Scholar
Jee, J. Automated real-world data integration improves cancer outcome prediction. Nature 636, 728–736 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sun, W., Rumshisky, A. & Uzuner, O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. J. Am. Med. Inform. Assoc. 20, 806–813 (2013).
Article PubMed PubMed Central Google Scholar
Kumar, V., Stubbs, A., Shaw, S. & Uzuner, Özlem Creation of a new longitudinal corpus of clinical narratives. J. Biomed. Inform. 58, S6–S10 (2015).
Article PubMed PubMed Central Google Scholar
Lhoest, Q. et al. Datasets: a community library for natural language processing. In Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Adel, H. & Shi, S.) 175–184 (Association for Computational Linguistics, 2021).
Wu, P., Wang, S., Dela Rosa, K. & Hu, D. FORB: a flat object retrieval benchmark for universal image embedding. Adv. Neural Inf. Process. Syst. 36, (2023).
Yang, K. et al. Leandojo: Theorem proving with retrieval-augmented language models. Adv. Neural Inf. Process. Syst. 36, (2023).
Hinkson, I. V. A comprehensive infrastructure for Big Data in cancer research: accelerating cancer research andÿprecision medicine. Front. Cell Dev. Biol. 5, 83 (2017).
Article PubMed PubMed Central Google Scholar
Thangudu, R. et al. Abstract LB-242: proteomic data commons: a resource for proteogenomic analysis. Cancer Res. 80, LB–242 (2020).
Article Google Scholar
Grossman, R. L. et al. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 (2016).
Article PubMed PubMed Central Google Scholar
Fedorov, A. et al. Kikinis: NCI imaging data commons. Int. J. Radiat. Oncol.*Biol.*Phys. 111, e101 (2021). (2021 Proc. of the ASTRO 63rd Annual Meeting).
Clark, K. et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging 26, 1045–1057 (2013).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research was supported by NSF Awards 2234836 and 2234468 and NAIRR pilot funding.

Author information

These authors contributed equally: Aakash Tripathi, Asim Waqas.

Authors and Affiliations

Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL, USA
Aakash Tripathi, Asim Waqas & Ghulam Rasool
Department of Electrical Engineering, University of South Florida, Tampa, FL, USA
Aakash Tripathi, Yasin Yilmaz & Ghulam Rasool
Departments of Cancer Epidemiology, Moffitt Cancer Center & Research Institute, Tampa, FL, USA
Asim Waqas & Matthew B. Schabath

Authors

Aakash Tripathi
View author publications
Search author on:PubMed Google Scholar
Asim Waqas
View author publications
Search author on:PubMed Google Scholar
Matthew B. Schabath
View author publications
Search author on:PubMed Google Scholar
Yasin Yilmaz
View author publications
Search author on:PubMed Google Scholar
Ghulam Rasool
View author publications
Search author on:PubMed Google Scholar

Contributions

A.T. led the development of the multimodal embedding framework, designed and implemented the neural network architectures, and conducted computational experiments. A.W. led the molecular processing and organization efforts and contributed to the molecular model development. M.B.S. provided clinical expertise, validated the clinical relevance of the findings, and contributed to the interpretation of results. Y.Y. provided guidance on algorithm design and statistical analysis. GR conceived and supervised the study, secured funding, and provided overall direction. All authors contributed to the manuscript writing and revision.

Corresponding author

Correspondence to Aakash Tripathi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tripathi, A., Waqas, A., Schabath, M.B. et al. HONeYBEE: enabling scalable multimodal AI in oncology through foundation model-driven embeddings. npj Digit. Med. 8, 622 (2025). https://doi.org/10.1038/s41746-025-02003-4

Download citation

Received: 09 April 2025
Accepted: 11 September 2025
Published: 23 October 2025
Version of record: 23 October 2025
DOI: https://doi.org/10.1038/s41746-025-02003-4

This article is cited by

Foundation model embeddings for multimodal oncology data integration
- Tara P. Menon
- Arjun Mahajan
- Dylan Powell
npj Digital Medicine (2026)