Fig. 5: Molecular and clinical data processing pipelines implemented in HONeYBEE. | npj Digital Medicine

Fig. 5: Molecular and clinical data processing pipelines implemented in HONeYBEE.

From: HONeYBEE: enabling scalable multimodal AI in oncology through foundation model-driven embeddings

Fig. 5

A Molecular processing: Multimodal molecular data—such as protein expression, DNA methylation, gene expression, DNA mutation, and miRNA expression—is acquired from public repositories. Preprocessing includes the removal of missing values, constant or duplicate features, low-expression genes, and collinear features. Features are first unified within each modality, then across modalities, and subsequently integrated with clinical data. The combined feature set is passed through a molecular encoder19 to generate embeddings or predictions for downstream tasks. B Clinical processing: Structured and unstructured clinical data, including EHRs, PDFs, and scanned reports, undergo entity recognition, normalization, and embedding. Key stages include: (i) document input, (ii) entity extraction and OCR for scanned text, (iii) tokenization using domain-specific tokenizers, and (iv) embedding generation using Hugging Face-compatible language models. The pipeline supports concept mapping (e.g., ICD, SNOMED CT), accuracy benchmarking, timeline construction, and identification of cancer-specific terms, enabling structured integration of clinical narratives for downstream AI applications.

Back to article page