Improving the robustness and stability of a machine learning model for breast cancer prognosis through the use of multi-modal classifiers

Arya, Nikhilanand; Saha, Sriparna; Mathur, Archana; Saha, Snehanshu

doi:10.1038/s41598-023-30143-8

Download PDF

Article
Open access
Published: 11 March 2023

Improving the robustness and stability of a machine learning model for breast cancer prognosis through the use of multi-modal classifiers

Nikhilanand Arya¹,
Sriparna Saha¹,
Archana Mathur² &
…
Snehanshu Saha³

Scientific Reports volume 13, Article number: 4079 (2023) Cite this article

5880 Accesses
33 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Breast cancer is a deadly disease with a high mortality rate among PAN cancers. The advancements in biomedical information retrieval techniques have been beneficial in developing early prognosis and diagnosis systems for cancer patients. These systems provide the oncologist with plenty of information from several modalities to make the correct and feasible treatment plan for breast cancer patients and protect them from unnecessary therapies and their toxic side effects. The cancer patient’s related information can be collected using various modalities like clinical, copy number variation, DNA-methylation, microRNA sequencing, gene expression, and histopathological whole slide images. High dimensionality and heterogeneity in these modalities demand the development of some intelligent systems to understand related features to the prognosis and diagnosis of diseases and make correct predictions. In this work, we have studied some end-to-end systems having two main components : (a) dimensionality reduction techniques applied to original features from different modalities and (b) classification techniques applied to the fusion of reduced feature vectors from different modalities for automatic predictions of breast cancer patients into two categories: short-time and long-time survivors. Principal component analysis (PCA) and variational auto-encoders (VAEs) are used as the dimensionality reduction techniques, followed by support vector machines (SVM) or random forest as the machine learning classifiers. The study utilizes raw, PCA, and VAE extracted features of the TCGA-BRCA dataset from six different modalities as input to the machine learning classifiers. We conclude this study by suggesting that adding more modalities to the classifiers provides complementary information to the classifier and increases the stability and robustness of the classifiers. In this study, the multimodal classifiers have not been validated on primary data prospectively.

Deviation-support based fuzzy ensemble of multi-modal deep learning classifiers for breast cancer prognosis prediction

Article Open access 03 December 2023

Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer

Article Open access 28 June 2022

A multimodal machine learning model for the stratification of breast cancer risk

Article 04 December 2024

Introduction

The breast is an essential glandular organ in the female body. It comprises fat, connective, and breast tissues containing lobes, lobules, and milk glands. Rapid and uncontrolled growth of these tissues, more specifically breast cells, causes a prevalent and fatal disease, breast cancer. It is not worrisome if the growth is non-invasive and confined within the milk ducts and lobules, but it becomes deadly in invasive cases when it spreads to adjacent and non-adjacent tissues or organs. The invasive one is also identified as metastatic breast cancer. According to the statistics provided by Global Cancer Observatory (GCO), breast cancer has a significant contribution (11.7%) in the estimated cancer cases (19,292,789) globally for the year 2020. These statistics have considered patients of all sexes and ages but stopping at only female cancerous patients, the contribution of breast cancer cases increases to 24.5%. The deadly nature of this disease is observed in an estimated 1 million breast cancer deaths out of 3.2 million breast cancer cases by the year 2040. We need a better prognosis and diagnosis system that integrates qualitative and quantitative information from histology and genomics, respectively, to predict clinical outcomes in light of the grim future of breast cancer. By avoiding the harmful side effects of various cancer-related therapies, preventing overtreatment, and thereby lowering economic costs¹ , more effectively including and excluding patients in a randomized trial, and developing palliative care and hospice care systems², a precise and timely prognosis of cancer patients can help both patients and oncologists in choosing the appropriate treatment. Hence, it is essential to know the severity of the disease and the life expectancy of a breast cancer patient at an early stage of treatment. It does not seem easy to predict the patient’s survival due to the heterogeneous and complex nature of cancer-related data coming from genetic and histopathological sources. The clinical outcomes also vary significantly from patient to patient, which makes it a further challenging task³. The standard survival cut-off for any disease is five years because it takes this much time to label the patient’s survivability. It can be validated from earlier research where authors have utilized a 5-year criterion to determine the cohort’s survivability^4,5,6,7. Although survival rates have increased dramatically in recent years due to breast cancer’s complexity, individual differences in the 5-year survival rate remain. Short-term and long-term survivors are defined as patients with life expectancies that fall below and over the survival cut-off. The availability of large-scale multi-modal data such as METABRIC⁸, The Cancer Genome Atlas (TCGA)⁹, derived from advanced histopathology and next-generation genome sequencing is supportive in the development of automated and artificially intelligent machine learning architectures. These architectures can be developed for different tasks of breast cancer diagnosis and prognosis and help oncologists with precise treatment plans. As medical research has matured, a variety of cutting-edge and distinctive methods for leveraging big data analytic techniques in the development of survival prediction models have been created. Machine learning (ML) is capable of handling dependencies and nonlinear interactions between variables, requires no implicit assumptions and can learn data models on its own¹⁰. It excels at handling the vast majority of complex higher-order interactions that can be discovered in medical data. As a result, machine learning techniques have enormous potential for usage as cutting-edge health informatics tools in routine medical practices.

Related work and motivations

In this section, we will look at past research on breast cancer prognosis and diagnosis, followed by the limitations that became motivations for our proposed work. The journey of uni-modal to multi-modal architectures in breast cancer prognosis and diagnosis starts from the usefulness of gene expression profiles derived with the help of microarray technology. The initial days of the research were based on identifying gene signatures or gene markers responsible for the disease. In the task of breast cancer prognosis, Van’t Veer et al.¹¹ identified 70-gene markers with microarray data of very few primary breast cancer patients. Further, Van et al.¹² validated the usefulness of these gene markers using a larger group of breast cancer patients. This uni-modal supervised classification task is done with the help of some popular machine learning techniques such as Support Vector Machine (SVM)¹³ and Random Forest (RF)¹⁴. Here, Xu et al.¹³ trained the SVM-based feature selection model using microarray data to identify the smaller but informative subset of signature genes responsible for breast cancer prognosis prediction. Shifting the research from traditional microarray profiles to histopathological tissue images, Nguyen et al.¹⁴ outperformed the previous results with the help of tissue images-based RF classifier, which has the power of multiple decision trees. This study combines Bayesian probability and RF for feature ranking and classification. Further, the exploration to identify the significance of more than one modality in the breast cancer prognosis started, and the integration of microarray profiles with either clinical profiles or histopathological data has found its way as bi-modal study. In terms of gene expression and clinical details integration, I-RELIEF¹⁵ was proposed as a feature selection algorithm to select hybrid signatures with three genes and two clinical markers for breast cancer prognosis; Gevaert et al.¹⁶ employed the bayesian network for the prognosis of lymph-node negative breast cancer over the pathological and gene expression data; Khademi et al.¹⁷ developed the probabilistic graphical model (PGM) and integrated two models exclusive of each other, which are based on microarray and clinical profiles. Not limiting the bi-modal study to gene expression and clinical profiles, ENCAPP¹⁸ (an elastic-net-based technique) used the protein-protein interaction network in collaboration with gene expression for getting the odds of survival in human breast cancer patients. The paradigm shift from uni to multi-modality can be validated in some of the recent works on breast cancer prognosis, and diagnosis^5,6,7. Sun et al. also created the GPMKL¹⁹ approach based on the ensemble of multiple kernel functions, which integrates genomic data and pathological images to predict breast cancer prognosis.

Understanding aberrant images and tumor morphology has never been simpler because of advances in medical imaging technologies^20,21. Assuming that pathological images may provide extra information about tumor features, the current study created some computational algorithms for forecasting cancer clinical outcomes based on those images. Wang et al.²² established an integrated framework for non-small cell lung cancer computer-aided diagnosis and survival analysis using suggestive markers from histopathology slides. Using pathological picture characteristics and gene expression profiles, Zhu et al.²³ developed a prediction model for lung cancer survival. Using 2186 Hematoxylin and Eosin pathological whole-slide images (WSIs) of non-small cell lung cancer, Yu et al.²⁴ retrieved 9879 significant picture attributes and utilized traditional classification methods to distinguish between short-term and long-term survivors. With the availability of histopathological images and the need for the research of survival analysis in the medical field in mind, Tang et al. created the CapSurv Capsule Network²⁵. The foundation of this approach is a special function termed survival loss, specifically developed for the analysis of cancer patients’ survival rates.

A thorough assessment of the literature revealed that gene expression profiles are useful for predicting breast cancer prognosis. However, high-dimensional microarray data and gene correlation pose severe problems in identifying gene markers and predicting their prognosis. Any machine learning architecture suffers from the curse of dimensionality when combining multiple high-dimensional modality combinations. Due to the complexity and heterogeneity of this severe disease, there is currently not enough research on multi-modal breast cancer clinical outcome analysis, despite the encouraging results of the approaches discussed previously for other cancer prognoses. While this is going on, the ever-increasing number of variables from diverse data sources and the use of heterogeneous traits may make it very challenging to combine them effectively for predicting breast cancer survival. Despite several multi-modal studies for breast cancer, no study includes all six modalities (clinical, copy number variation, DNA-methylation, micro-RNA Sequencing, gene expression, and histopathological whole slide images) for survival classification. The current study investigates the impact of multiple modalities in machine learning-based breast cancer survival classifiers. But the raw features from all these modalities are very high-dimensional, consisting of redundant and noisy attributes. The high-dimensional feature space hampers the learning of the machine learning classifiers. Hence, we need to reduce the feature space by feature selection or extraction techniques. The reduced feature space boosts the learning of the model in terms of its generalization ability and computational complexity. So, the first task of our model is to extract the pertinent features from all the modalities using VAEs or PCAs, and the second task is to fuse the extracted features for the final classification.

In this work, we propose the following:

1.
Investigation of six different modalities (clinical, copy number variation, DNA-Methylation, miRSeq, mRNASeq and Histopathological whole slide images) in the task of breast cancer survival prediction.
2.
Log-cosh VAEs as dimensionality reduction technique for all the modalities, where L$_{1}$ or L$_{2}$ loss of VAE is replaced with logcos(h) loss function.
3.
Six different PCAs as dimensionality reduction techniques for all the modalities.
4.
Raw features, PCA projected features or Log-cosh VAEs extracted features as input to various machine learning classifiers (SVMs, Random Forest) for the survival estimation task.

Dataset and methods

Dataset

The study is focused on breast cancer prognosis prediction based on multiple modalities, and the most suitable dataset for this study is The Cancer Genome Atlas Program for Breast Cancer (TCGA-BRCA). We have downloaded this data from https://xenabrowser.net/datapages/. TCGA-BRCA dataset has total of six modalities for each breast cancer patient. It includes clinical (${{\textbf {m}}}_{{cln}}$) details of 1236 patients, copy number variations (${{\textbf {m}}}_{{cnv}}$) of 1079 patients, DNA-Methylation (${{\textbf {m}}}_{{dna}}$) with common attributes between Illumina Human Methylation 450 and Illumina Human Methylation 27 of 1227 patients, miRSeq (${{\textbf {m}}}_{{mir}}$) with IlluminaGA and IlluminaHiseq of 1188 patients, mRNASeq (${{\textbf {m}}}_{{mrna}}$) of 1215 patients, and Histopathological Whole Slide Images (${{\textbf {m}}}_{{wsi}}$) of 1068 patients. The raw dataset does not have all the patients present throughout various modalities resulting in incomplete-multi-view data. In a multi-modal learning framework, learning of any machine learning model becomes troublesome due to incomplete multi-view scenarios²⁶. Hence, we have selected the complete multi-view data among all the possible combinations of modalities using the intersection technique based on common patient ids. The final selected patients for all possible combinations are presented in Table 1.

Table 1 Number of common samples for various combinations of modalities

Full size table

For ${{\textbf {m}}}_{{cln}}$ modality, we select a total of 21 categorical features consisting of demographic details (age, gender, race, ethnicity) and primary cancer diagnostic details (history of radiation therapy and breast cancer surgery, menopause status, prior and synchronous malignancy, tumor stage, etc.). The age attribute of ${{\textbf {m}}}_{{cln}}$ is normalized in the range [0,1] using min-max normalization, and other attributes are used as categorical integers. For ${{\textbf {m}}}_{{cnv}}$ modality, we select GISTIC2 thresholded 24,776 genes with estimated values − 2, − 1, 0, 1, 2, representing homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification, respectively. For ${{\textbf {m}}}_{{dna}}$ modality, we select 23,381 common DNA methylation features from Illumina Infinium HumanMethylation27 and HumanMethylation450 platforms. For ${{\textbf {m}}}_{{mir}}$ modality, we select 1883 common miRBase accession numbers between IlluminaGA and IlluminaHiseq of miRNA mature strand expression RNAseq. For ${{\textbf {m}}}_{{mrna}}$ modality, we select 20,530 genes representing the gene expression values for each patient. The selected genes are further discretized¹⁶ with values − 1, 0, and 1 representing under-expressed, baseline, and over-expressed genes. We observed certain missing values in the raw data and handled them as follows; if a particular feature had more than 10% missing values, we discarded them, else the weighted nearest neighbor approach²⁷ was employed for the estimation of missing values. Finally, the shapes of the raw data are $1236 \times 21$, $1079 \times 24,776$, $1227 \times 23,381$, $1188 \times 1883$, and $1215 \times 20,530$ for ${{\textbf {m}}}_{{cln}}$, ${{\textbf {m}}}_{{cnv}}$, ${{\textbf {m}}}_{{dna}}$, ${{\textbf {m}}}_{{mir}}$, and ${{\textbf {m}}}_{{mrna}}$ modalities, respectively. To handle the curse of dimensionality arising due to high dimensional ${{\textbf {m}}}_{{cnv}}$ and ${{\textbf {m}}}_{{mrna}}$ modalities, or its combinations (bi, tri, quad, pent, hex) with other modalities, we select the top 500 attributes as the best representative subset for the given sample space to acquire the variance higher than 98% throughout the entire cancer patients present in the respective modalities. We discarded the attributes whose values are similar or equal for all patients because these attributes are not helpful for the machine learning classifiers. On the contrary, these attributes can also be bottlenecks for our classification task.

To represent and encode ${{\textbf {m}}}_{{wsi}}$ modality, we needed to develop deep learning methods that can effectively extract hidden informative features from WSIs. However, the high resolution of WSIs makes it difficult to learn from them. Thus, there must be an element of stochastic sampling and filtering involved. In this work, we use a relatively simple approach to sample WSIs. We sample 256 $\times$ 256 pixel patches at the highest resolution using “PyHIST: A Histological Image Segmentation Tool”²⁸. Then, we select the top 20% of the generated patches (or 40 patches) with the highest RGB density as ROIs; this ensures that ‘non-representative’ patches belonging to white-space are ignored, and densest tiles include more cells for further investigations²⁴. These 40 ROIs represent, on average, 15% of the tissue region within the WSI. Each of these 40 patches is passed through a more profound but less complex CNN, ResNet152²⁹ pre-trained over ImageNet dataset³⁰ to get the hidden informative features of 2048 dimensions from the last hidden layer.

With the help of survival distribution details extracted from clinical profiles of patients and a 5-year survival cut-off, our problem is defined as the binary classification task as long-term survivors (labeled as 0) and short-term survivors (labeled as 1).

Methods

The study of multi-modal survival classification task of breast cancer patients is performed through machine learning classifiers such as RBF SVM, (${rbf}\_{{svm}}$) linear SVM (${linear}\_{{svm}}$), polynomial SVM (${polynomial}\_{{svm}}$), sigmoid SVM (${sigmoid}\_{{svm}}$) and Random Forest (RF). The low sample size and high-class imbalance of the TCGA-BRCA dataset cause the deep neural networks to lose their generalization abilities resulting in overfitting. Hence, we limit our study to machine learning classifiers only.

We use three sets of features in the proposed work for our classification. The first set is the selected raw features contributing to 98% variance throughout the instances. It is evident from the dataset section that each uni-modality (except ${{\textbf {m}}}_{{cln}}$) or its multi-modal combinations have high-dimensional feature space and low sample size. The dataset with feature space larger than the number of samples usually suffers from the curse of dimensionality problem³¹ during machine learning training. Acording to the situation, it is important to generate the latent feature with reduced dimensions without losing the relevant information. Apart from raw features, the second feature set is the PCA projected features with reduced dimensions, and the last feature set is extracted from the log-cosh VAE. The final prediction model can be visualized with the flow chart given in Fig. 1.

PCA as dimensionality reduction

As PCA is the traditional yet effective dimensionality reduction technique for high dimensional datasets with its ability to project a large number of features to small feature space while preserving as much information as possible, it has been widely used for bioinformatics studies^17,32. It can predict a sequence of the best linear combinations based on the original attributes of a certain set and reveals a new and reduced set of variables, determined to be the principal components, while also ensuring that little data from the original set is excluded from the analysis. The principal components are ordered according to the highest variance based on the original attributes³³. In PCA-based feature reduction, we preserve 95% variance among the final selected features for all the uni-modalities. The selected PCA features of ${{\textbf {m}}}_{{cnv}}$ are derived from the linear combination of cnv raw features responsible for the highest variance in the homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, high-level copy number amplification. The selected PCA features of ${{\textbf {m}}}_{{dna}}$ are derived from the linear combination of dna raw features responsible for the highest variance in the ratio of the intensity of the methylated bead type to the combined locus intensity, representing hypermethylation and hypomethylation of the DNA. The selected PCA features of ${{\textbf {m}}}_{{mrna}}$ are derived from the linear combination of mrna raw features responsible for the highest variance in the under-expressed, baseline, and over-expressed genes. Similarly, ${{\textbf {m}}}_{{cln}}$ ${{\textbf {m}}}_{{mir}}$ and ${{\textbf {m}}}_{{wsi}}$ also have the linear combination of their raw features as final PCA features.

For ${{\textbf {m}}}_{{wsi}}$, we projected the hidden features embeddings of pre-trained ResNet152 into the feature space of 512 dimensions with the help of PCA. Further, we concatenated embeddings of all 40 patches to generate the WSI features of size 20,480 for each patient. In the WSI case also, the machine learning model’s training can not happen due to the small number of samples and the large feature space. So, we further reduced the features to 800 dimensions with the help of PCA while preserving 95% variance. Finally, we have the dataset of size 1068 $\times$ 800 for ${{\textbf {m}}}_{{wsi}}$ modality.

Log-cosh VAE as dimensionality reduction

In VAE based feature reduction technique, we design variational autoencoders (VAEs) with log-cosh loss, $VAE_{cln}$, $VAE_{cnv}$, $VAE_{dna}$, $VAE_{mir}$, $VAE_{mrna}$, and $VAE_{wsi}$ for corresponding modalities ${{\textbf {m}}}_{{cln}}$, ${{\textbf {m}}}_{{cnv}}$, ${{\textbf {m}}}_{{dna}}$, ${{\textbf {m}}}_{{mir}}$, ${{\textbf {m}}}_{{mrna}}$, and ${{\textbf {m}}}_{{wsi}}$, respectively. A simple autoencoder neural network is powerful enough to project any N-dimensional data to its L, latent dimensions with the highest degree of freedom. But, the high degree of freedom causes the autoencoder to overfit while minimizing the reconstruction loss and generating more specified (less regularized) latent features. To ensure the regularity of the latent space and an enhanced decoder for the regeneration of the raw data, VAE introduces explicit regularization during the training process. Here, we encode the uni-modal raw features as a normal distribution over latent space. In the process of optimizing the regularisation term, Kulback-Leibler divergence between the returned distribution and a standard normal distribution, the encoder is enforced to return a distribution resembling a standard normal distribution by learning the mean and variance of the normal distribution. A random point is sampled from the latent space for reconstruction using a decoder, and reconstruction loss is computed. This can be visualized with the Fig. 2, and the mathematical foundation of the ${VAE}_{cln}$ presented below.

Lets say, the encoder (q) of ${VAE}_{cln}$ encodes the raw cln features $x_{cln}$ to latent space $z_{cln}\sim q_{\phi }\left( z_{cln}|x_{cln}\right)$, while the decoder (p) reconstructs the raw features as ${\hat{x}}_{cln}\sim p_{\theta }\left( x_{cln}|z_{cln}\right)$ from the latent space $z_{cln}$. Here, $\phi$ and $\theta$ are the encoder and decoder parameters, respectively. The ${VAE}_{cln}$ is designed with two objectives: first is to match the decoded data point ${\hat{x}}_{cln}$ with real clinical features, $x_{cln}$, and second is to maintain the posterior distribution $q_{\phi }\left( z_{cln}|x_{cln}\right)$ to a given prior distribution (standard normal distribution) $p\left( z_{cln}\right)$. So, the VAE is trained with the objective to minimize the following loss function:

$$\begin{aligned} \begin{aligned} {\mathscr {L}}\left( \theta ,\phi ,x_{cln}\right) = -{\mathbb {E}}_{q_{\phi }(z_{cln}|x_{cln})} log p_{\theta }(x_{cln}|z_{cln})+D_{KL}(q_{\phi }(z_{cln}|x_{cln})||p(z_{cln})). \end{aligned} \end{aligned}$$

(1)

The first term in Eq. (1) represents the reconstruction loss, which becomes $L_{2}$ loss if the decoder predicts the Gaussian distribution $p_{\theta }\left( x_{cln}|z_{cln}\right) \propto \exp (-{\parallel x_{cln}-{\hat{x}}_{cln} \parallel }_{2}^2)$, and $L_{1}$ loss if decoder predicts the zero-mean Laplacian distribution $p_{\theta }\left( x_{cln}|z_{cln}\right) \propto \exp (-{\parallel x_{cln}-{\hat{x}}_{cln} \parallel }_{1})$. The second term tries to match the distribution of latent space to the prior distribution $p(z_{cln})$, which defaults to the standard normal distribution.

If the squared $L_{2}$ loss is incorporated as reconstruction loss, then the Eq. (1) is dominated by the second term (KL divergence term), as $L_{2}$ penalizes small reconstruction errors too lightly, and reconstruction accuracy deteriorates. If the $L_{1}$ loss is incorporated, the larger reconstruction error is penalized too much and harms the optimization of latent space. Another major issue with $L_{1}$ loss is that it is non-differentiable at 0, causing oscillation between $\pm 1$.

To overcome the limitations of $L_{1}$ and $L_{2}$ reconstruction losses, we use log-cosh as the reconstruction loss of the proposed VAE, which takes the benefits of both squared $L_{2}$ loss and $L_{1}$ loss. It also satisfies the basic motivation of VAE to have a balance between reconstruction accuracy and regular latent space generation. The log-cosh loss is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{logcosh}(x,{\hat{x}})=\sum _{i}f(x_{i}-{\hat{x}}_{i},a)=\frac{1}{a}\sum _{i}log(cosh(a(x_{i}-{\hat{x}}_{i}))),\\ where, cosh(a(x_{i}-{\hat{x}}_{i})) = \frac{\exp [{a(x_{i}-{\hat{x}}_{i})}]+\exp [{-a(x_{i}-{\hat{x}}_{i})}]}{2} \end{aligned} \end{aligned}$$

(2)

Here, $a\in \mathcal {R^+}$ is a hyper-parameter. Finally, the proposed VAE is trained on the regularization term (Kulback–Leibler divergence) and reconstruction term (logcosh). $VAE_{cln}$, $VAE_{cnv}$, $VAE_{dna}$, $VAE_{mir}$, $VAE_{mrna}$, and $VAE_{wsi}$ are optimized with motive of minimizing the following equations, respectively:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{modality} = {\mathcal {L}}_{logcosh}(x_{modality},{\hat{x}}_{modality})+D_{KL}(q_{\phi }(z_{modality}|x_{modality})||p(z_{modality}))\\ where, modality \in \{cln,cnv,dna,mir,mrna,wsi\}. \end{aligned} \end{aligned}$$

(3)

The latent feature space for $VAE_{cln}$ is 4 and for all other VAEs, it is fixed to 32. We use encoded latent features from each VAE and further concatenate them to form the multi-modal feature set towards the final classification goal.

Results and analysis

In this section, we perform a comparative analysis of SVM based on some popular kernels and the random forest classifier on raw, PCA projected, and VAE extracted features. The evaluation of the proposed framework is carried out using conventional performance measures like accuracy (Acc), precision (Pre), sensitivity (Sn), and f_1-score. To calculate these measures, we use four scenarios: if a patient is survived below 5 years and predicted as a short-term survivor; if a patient is survived more than 5 years and predicted as a long-term survivor; if a patient is survived more than 5 years and predicted as a short-term survivor; if a patient is survived less than 5 years and predicted as a long-term survivor as true positive (tp); true negative (tn); false positive (fp) and false negative (fn), respectively.

Experimental setup

All the experimental analyses in this study are carried out using the system having ubuntu 18.04.5 LTS along with 11 GB of NVIDIA GeForce RTX 2080 Ti and 32 GB RAM. The coding setup is having keras-gpu 2.2.4 with tensorflow-gpu 1.14.0 in backend and python 3.6 bundled in anaconda environment.

In the task of multi-modal breast cancer survival classification, we performed an ablation study for all possible combinations of modalities ranging from uni-modal to Hexa-modal. The learning is performed with machine learning classifiers for

Uni-modal: [ ${{\textbf {m}}}_{{cln}}$, ${{\textbf {m}}}_{{cnv}}$, ${{\textbf {m}}}_{{dna}}$, ${{\textbf {m}}}_{{mir}}$, ${{\textbf {m}}}_{{mrna}}$, ${{\textbf {m}}}_{{wsi}}$ ]; Bi-modal: [ ${{\textbf {m}}}_{{cln\_cnv}}$, ${{\textbf {m}}}_{{cln\_dna}}$, ${{\textbf {m}}}_{{cln\_mir}}$, ${{\textbf {m}}}_{{cln\_mrna}}$, ${{\textbf {m}}}_{{cln\_wsi}}$, ${{\textbf {m}}}_{{cnv\_dna}}$, ${{\textbf {m}}}_{{cnv\_mir}}$, ${{\textbf {m}}}_{{cnv\_mrna}}$, ${{\textbf {m}}}_{{cnv\_wsi}}$, ${{\textbf {m}}}_{{dna\_mir}}$, ${{\textbf {m}}}_{{dna\_mrna}}$, ${{\textbf {m}}}_{{dna\_wsi}}$, ${{\textbf {m}}}_{{mir\_mrna}}$, ${{\textbf {m}}}_{{mir\_wsi}}$, ${{\textbf {m}}}_{{mrna\_wsi}}$. ]; Tri-modal: [ ${{\textbf {m}}}_{{cln\_cnv\_dna}}$, ${{\textbf {m}}}_{{cln\_cnv\_mir}}$, ${{\textbf {m}}}_{{cln\_cnv\_mrna}}$, ${{\textbf {m}}}_{{cln\_cnv\_wsi}}$, ${{\textbf {m}}}_{{cln\_dna\_mir}}$, ${{\textbf {m}}}_{{cln\_dna\_mrna}}$, ${{\textbf {m}}}_{{cln\_dna\_wsi}}$, ${{\textbf {m}}}_{{cln\_mir\_mrna}}$, ${{\textbf {m}}}_{{cln\_mir\_wsi}}$, ${{\textbf {m}}}_{{cln\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cnv\_dna\_mir}}$, ${{\textbf {m}}}_{{cnv\_dna\_mrna}}$, ${{\textbf {m}}}_{{cnv\_dna\_wsi}}$, ${{\textbf {m}}}_{{cnv\_mir\_mrna}}$, ${{\textbf {m}}}_{{cnv\_mir\_wsi}}$, ${{\textbf {m}}}_{{cnv\_mrna\_wsi}}$, ${{\textbf {m}}}_{{dna\_mir\_mrna}}$, ${{\textbf {m}}}_{{dna\_mir\_wsi}}$, ${{\textbf {m}}}_{{dna\_mrna\_wsi}}$, ${{\textbf {m}}}_{{mir\_mrna\_wsi}}$ ]; Quad-modal: [ ${{\textbf {m}}}_{{cln\_cnv\_dna\_mir}}$, ${{\textbf {m}}}_{{cln\_cnv\_dna\_mrna}}$, ${{\textbf {m}}}_{{cln\_cnv\_dna\_wsi}}$, ${{\textbf {m}}}_{{cln\_cnv\_mir\_mrna}}$, ${{\textbf {m}}}_{{cln\_cnv\_mir\_wsi}}$, ${{\textbf {m}}}_{{cln\_cnv\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cln\_dna\_mir\_mrna}}$, ${{\textbf {m}}}_{{cln\_dna\_mir\_wsi}}$, ${{\textbf {m}}}_{{cln\_dna\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cln\_mir\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cnv\_dna\_mir\_mrna}}$, ${{\textbf {m}}}_{{cnv\_dna\_mir\_wsi}}$, ${{\textbf {m}}}_{{cnv\_dna\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cnv\_mir\_mrna\_wsi}}$, ${{\textbf {m}}}_{{dna\_mir\_mrna\_wsi}}$ ]; Penta-modal: [ ${{\textbf {m}}}_{{cln\_cnv\_dna\_mir\_mrna}}$, ${{\textbf {m}}}_{{cln\_cnv\_dna\_mir\_wsi}}$, ${{\textbf {m}}}_{{cln\_cnv\_dna\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cln\_cnv\_mir\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cln\_dna\_mir\_mrna\_wsi}}$, ${{\textbf {m}}}_{{cnv\_dna\_mir\_mrna\_wsi}}$. ]; and Hexa or Multi-modal: [ ${{\textbf {m}}}_{{cln\_cnv\_dna\_mir\_mrna\_wsi}}$] combinations of modalities. Machine learning classifiers are trained in ten-fold stratified cross-validation set-up. Each fold upsamples the training instances of the minority class to discard the impact of high-class imbalance. In the process of moving from uni-modal to multi-modal machine learning classifiers, the increased feature space of the integrated input data raises the concern related to model overfitting. To avoid this problem, our SVMs are trained as soft-margin SVMs with a small regularization parameter. We also incorporated another hyperparameter gamma which is inversely proportional to the number of input features and controls the influence of the individual data points on the decision boundary. The larger the input feature space implies a smaller value of gamma resulting in a more generic decision boundary influenced by more data points. The following subsections provide a comparative analysis between various combinations of modalities on breast cancer survival prediction.

Uni-modal

As we have six different uni-modalities (${m_{cln}}$, ${m_{cnv}}$, ${m_{dna}}$, ${m_{mir}}$, ${m_{mrna}}$ and ${m_{wsi}}$) in this study, we will compare the performance with each other using various performance measures as depicted in Table 2. The polynomial_svm is the best classifier among other SVM kernels and random forest for uni-modal data. The average ${Acc, f_{1}-score, and\ Sn}$ of this classifier are 0.706, 0.822, and 0.895, respectively, which are obtained for the VAE-extracted features.

As the study is focused on predicting the survival of the patients and concerning treatment planning based on the predictions, we take ${f_{1}-score}$ as the dominating evaluation measure for comparisons. If we compare the performances of various machine learning classifiers on raw, PCA, and VAE extracted features at each modality level:

then VAE features of cln modality-based polynomial_svm outperforms the raw and PCA extracted features of cln modality based random forest classifiers by 2.7% and 1.0%, respectively.
then cnv VAE feature based polynomial_svm outperforms the cnv raw and PCA features based polynomial_svm by 5.8% and 8.1%, respectively.
then dna VAE feature based polynomial_svm outperforms raw feature based RF by 4.6% and PCA features based polynomial_svm by 1.5%.
then mir VAE feature based polynomial_svm outperforms raw feature based polynomial_svm by 4.8% and PCA feature based polynomial_svm has similar performance.
then mrna VAE feature based polynomial_svm is best in terms of ${f_{1}-score}$ (0.824). It outperforms the raw feature based polynomial_svm by 2.0% and gives similar performance with PCA feature based polynomial_svm.
then wsi raw, PCA and VAE feature based rbf_svms perform similar with ${f_{1}-score}$ of 0.840.

Bi-modal

Here, we perform the comparative study of all fifteen $(^6C_2)$ bi-modal combinations of features generated from the above six uni-modalities. Among all possible bi-modal combinations, rbf_svm for mir_wsi combination of raw features; RF for dna_wsi combination of PCA features; and RF for cnv_mrna combination of VAE features are the best classifiers with ${Acc, f_{1}-score, Sn, and\ Pre}$ values of 0.771, 0.871, 1.00, and 0.771; 0.774, 0.872, 1.00, and 0.774; and 0.776, 0.872, 1.00, and 0.774, respectively. Further, if we compare the average of all bi-modal combinations for raw, PCA, and VAE features, then VAE extracted feature vector-based SVMs, and RF classifiers outperform the other two. The average ${Acc, f_{1}-score, Sn, and\ Pre}$ values of the VAE-based RF classifier are 0.759, 0.862, 0.986, and 0.766; respectively, while PCA extracted feature vector-based RF attains 0.755, 0.858, 0.968, and 0.771 and raw feature-based RF attains 0.738, 0.847, 0.947, and 0.766 values for these metrics, respectively. Similarly, VAE feature-based rbf_svm outperforms raw, and PCA feature-based rbf_svm by 5.6% and 4.0% and VAE feature-based polynomial_svm outperforms raw and PCA feature based polynomial_svm by 9.6% and 14.1%, in terms of average $f_{1}-score$. The average results for all bi-modal performance measures are depicted in Table 2.

Tri-modal

Here, we perform the comparative study of all twenty $(^6C_3)$ tri-modal combinations generated from the above six uni-modalities. Among all possible tri-modal combinations, rbf_svm for cln_cnv_wsi, cln_mir_wsi, cnv_dna_wsi, cnv_mrna_wsi and mir_mrna_wsi combinations of raw features; RF for dna_mrna_wsi combination of PCA features; and polynomial_svm for cln_mrna_wsi and dna_mrna_wsi combinations of VAE features are best classifiers. The attained ${Acc, f_{1}-score, Sn, and\ Pre}$ values for raw feature-based rbf_svm are 0.771, 0.871, 1.00, and 0.771; for PCA feature-based RF and VAE feature-based polynomial_svm are 0.774, 0.872, 1.00 and 0.774, respectively. Further, we compare the average of all tri-modal combinations for raw, PCA-extracted, and VAE-extracted features. The VAE-extracted feature-based RF classifier outperforms the other two. The average ${Acc, f_{1}-score, Sn, and\ Pre}$ values attained by VAE extracted feature-based RF classifier are 0.763, 0.865, 0.991, and 0.767, respectively, while raw and PCA extracted feature set-based random RF attains values of 0.739, 0.848, 0.949, and 0.766; and 0.755, 0.857, 0.964, 0.772 for these measures, respectively. The dominating measure, ${f_{1}-score}$, for VAE feature-based rbf_svm shows 3.6% and 2.2% improvements over raw and PCA feature-based rbf_svm, while VAE feature-based polynomial_svm outperforms raw and PCA feature based polynomial_svm by 7.2% and 8.9%, respectively. The average results for all tri-modal performance measures are depicted in Table 2.

Quad-modal

Here, we perform the comparative study of all fifteen $(^6C_4)$ quad-modal combinations generated from the above six uni-modalities. Among all possible quad-modal combinations, rbf_svm for cln_cnv_dna_wsi, cln_cnv_mrna_wsi, cln_mir_mrna_wsi, and cnv_dna_mrna_wsi combinations of raw features; for cln_cnv_dna_wsi, cln_mir_mrna_wsi, cln_cnv_mrna_wsi, and cln_cnv_dna_wsi combinations of PCA features; and RF for cln_dna_mrna_wsi combination of VAE features are best classifiers. The attained ${Acc, f_{1}-score, Sn, and\ Pre}$ values for raw and PCA feature-based rbf_svm are 0.771, 0.871, 1.00, and 0.771; for VAE feature based RF are 0.774, 0.872, 1.00 and 0.774, respectively. Further, if we compare the average of all quad-modal combinations for raw, PCA extracted, and VAE extracted features, then the VAE extracted feature-based RF classifier outperforms the other two with average ${Acc, f_{1}-score, Sn, and\ Pre}$ values 0.764, 0.866, 0.995, and 0.767, respectively, while raw and PCA extracted feature based RF classifier attained 0.738, 0.847, 0.949, and 0.765; and 0.753, 0.856, 0.957, and 0.775 in terms of these performance measures, respectively. The dominating measure, ${f_{1}-score}$, for VAE feature-based rbf_svm shows 1.7% and 0.9% improvements over raw and PCA feature-based rbf_svm, while VAE feature based polynomial_svm outperforms raw and PCA feature based polynomial_svm by 5.1% and 4.1%, respectively. The average results for all quad-modal performance measures are depicted in Table 2.

Penta-modal

Here, we perform the comparative study of all six $(^6C_5)$ penta-modal combinations generated from the above six uni-modalities. Among all possible penta-modal combinations, rbf_svm for cln_cnv_dna_mrna_wsi combination of raw features; for cln_cnv_dna_mrna_wsi combination of PCA features; and polynomial_svm for cln_cnv_dna_mrna_wsi combination of VAE features are best classifiers with ${Acc, f_{1}-score, Sn, and\ Pre}$ values of 0.771, 0.871, 1.00, and 0.771, respectively. Further, if we compare the average of all penta-modal combinations for raw, PCA extracted, and VAE extracted feature sets, then the VAE extracted feature-based RF classifier outperforms the other two. The average ${Acc, f_{1}-score, Sn, and\ Pre}$ values of the VAE extracted feature-based RF classifier are 0.762, 0.864, 0.991, and 0.766, respectively, while raw and PCA extracted feature based RF classifier attains ${Acc, f_{1}-score, Sn, and\ Pre}$ values of 0.742, 0.850, 0.954, and 0.767; and 0.752, 0.855, 0.952, 0.777, respectively. The dominating measure, ${f_{1}-score}$, for VAE feature-based rbf_svm shows comparable performance with raw and PCA feature-based rbf_svm, while VAE feature-based polynomial_svm outperforms raw and PCA feature based polynomial_svm by 3.4% and 3.9%, respectively. The average results for all penta-modal performance measures are depicted in Table 2.

Hexa-modal or multi-modal

Here, we perform the comparative study by combining the above six uni-modalities altogether. The raw and PCA feature-based rbf_svm and VAE extracted feature-based polynomial_svm are best among various multi-modal machine learning classifiers and achieves ${Acc, f_{1}-score, Sn, and\ Pre}$ values of 0.769, 0.870, 1.00, and 0.769, respectively. If we compare the multi-modal $f_{1}-score$ of VAE feature-based best classifier with uni-modal VAE feature-based best classifiers, then multi-modal polynomial_svm leads ${m_{cln}}$ polynomial_svm by 9.8%; ${m_{cnv}}$ polynomial_svm by 2.8%; ${m_{dna}}$ polynomial_svm by 4.5%; ${m_{mir}}$ polynomial_svm by 4.2%; ${m_{mrna}}$ polynomial_svm by 4.6%; and ${m_{wsi}}$ polynomial_svm by 3.0%. The gains of multi-modal over uni-modal measures establish the significance of all six modalities in the breast cancer survival prediction task. The detailed results can be visualized from Table 2.

Table 2 Performance measures of machine learning classifiers for breast cancer survival prediction considering various combinations of modalities.

Full size table

Identifying best feature set (Raw or PCA or VAE) and impact of additional modalities

A thorough empirical and ablation study on six different modalities suggests that VAE-extracted features are the best features for solving breast cancer survival prediction goals. Hence, to study the impact and usefulness of multiple modalities, we finalize VAE extracted features. The RF's average performance measures (${Acc, f_{1}-score, Sn, and\ Pre}$) of hexa-modal (multi-modal) are 0.769, 0.870, 1.00, and 0.769, which are 0.7%, 0.6%, 0.9%, and 0.3% higher than penta-modal; 0.5%, 0.4%, 0.5%, and 0.2% higher than quad-modal; 0.6%, 0.5%, 0.8%, and 0.2% higher than tri-modal; 1.1%, 0.8%, 1.4%, and 0.3% higher than bi-modal; 10.8%, 8.7%, 19.7%, and 0.6% higher than uni-modal average performance measures, respectively. Here, we observe the trend that adding extra modalities is gradually decreasing the performance gap of various classifiers. Further, we also investigate the standard deviation of the performance measures for all combinations of modalities and observe those standard deviations of ${Acc, f_{1}-score, Sn, and\ Pre}$ metrics as 2.94%, 2.55%, 5.45%, and 0.68% for uni-modal; 0.80%, 0.49%, 0.91%, and 0.56% for bi-modal; 0.47%, 0.32%, 0.24%, and 0.38% for tri-modal; 0.29%, 0.21%, 0.00%, and 0.29% for quad-modal; 0.23%, 0.17%, 0.00%, and 0.23% for Penta-modal, respectively. The decline in standard deviation values when we add extra modalities supports the fact that multi-modal classifiers are more stable and robust.

Discussion

In this study, we present the log-cosh VAEs to get the engineered features as compact representations from various modalities, which include images, genetics, and clinical data. Further, we trained and tested SVMs and Random forests over the various multimodal combinations of modalities for breast cancer prognosis prediction tasks. We also validated the superiority of VAE features over PCA and raw features. In our experiments, we observed the trend that the performance measures of different machine learning classifiers improve when we add more modalities as input to the classifiers. The empirical results from Table 2 suggest that polynomial_svm is the best classifier for uni, tri, quad, pent, and hex modalities, while the random forest is superior for bi-modal combination. The findings of our study establish the fact that all six sources are significant for prognosis, and oncologists must consider all of them while planning the treatment of any breast cancer patient. After this establishment, we also validated the stability and robustness of the classifiers for various multi-modal combinations. The gradual decline in the standard deviations of various performance measures for uni, bi, tri, quad, pent, and hex modal classifiers supports that classifiers with more modalities are highly stable. As the VAE features work as more supportive features for classification goals, they still have limitations in interpretability, as they may not have a clear biological meaning.

Conclusion

Breast cancer and its heterogeneous information coming from complex sources make it difficult for oncologists to sense the severity of the disease and plan the correct treatment path for the patients. As per our knowledge, it is the first study where we use six different modalities for breast cancer survival estimation. We propose various ways to reduce the complexity of these heterogeneous data. The first approach is the feature selection by preserving the variance of 98% throughout the instances, followed by the PCA as dimensionality reduction and log-cosh VAE as the feature extraction technique. We establish that the VAE extracted features are of very low dimensions (4 for clinical and 32 for other modalities) in contrast with raw and PCA feature space, still achieving better results for the breast cancer survival estimation task. Hence, log-cosh VAEs are reliable feature extraction techniques. We conclude this study with the claim that adding extra modalities has not only improved the performance of machine learning classifiers but also enhanced the stability and robustness over different combinations of modalities. It proves the positive significance of all modalities toward breast cancer survival rate prediction and oncologists should always consider all six modalities in their decision-making. Despite the significant performance of the proposed multimodal architectures in breast cancer prognosis over preprocessed TCGA-BRCA dataset, there is a need to test the models on primary data prospectively to understand their behavior and importance during real-time clinical trials of breast cancer patients.

Data availability

The dataset used in this study is available at (https://xenabrowser.net/datapages/) and source code can be accessed at (https://github.com/nikhilaryan92/logcoshVAE_brca_surv).

References

Altman, D. G. Prognostic models: A methodological framework and review of models for breast cancer. Cancer Investig. 27, 235–243. https://doi.org/10.1080/07357900802572110 (2009) (PMID: 19291527).
Article Google Scholar
Stone, P. & Lund, S. Predicting prognosis in patients with advanced cancer. Ann. Oncol. 18, 971–976. https://doi.org/10.1093/annonc/mdl343 (2007).
Article CAS PubMed Google Scholar
Martin, L. R., Williams, S. L., Haskard, K. B. & Dimatteo, M. R. The challenge of patient adherence. Ther. Clin. Risk Manag. 1, 189–199 (2005).
PubMed PubMed Central Google Scholar
Delen, D., Walker, G. & Kadam, A. Predicting breast cancer survivability: A comparison of three data mining methods. Artif. Intell. Med. 34, 113–127. https://doi.org/10.1016/j.artmed.2004.07.002 (2005).
Article PubMed Google Scholar
Sun, D., Wang, M. & Li, A. A multimodal deep neural network for human breast cancer prognosis prediction by integrating multi-dimensional data. IEEE/ACM Trans. Comput. Biol. Bioinf. 16, 841–850. https://doi.org/10.1109/TCBB.2018.2806438 (2019).
Article Google Scholar
Arya, N. & Saha, S. Multi-modal classification for human breast cancer prognosis prediction: Proposal of deep-learning based stacked ensemble model. IEEE ACM Trans. Comput. Biol. Bioinform.https://doi.org/10.1109/TCBB.2020.3018467 (2020).
Article Google Scholar
Arya, N. & Saha, S. Multi-modal advanced deep learning architectures for breast cancer survival prediction. Knowl.-Based Syst. 221, 106965. https://doi.org/10.1016/j.knosys.2021.106965 (2021).
Article Google Scholar
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352. https://doi.org/10.1038/nature10983 (2012).
Article CAS PubMed PubMed Central Google Scholar
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. (Poznan, Poland) 19, A68-77. https://doi.org/10.5114/wo.2014.47136 (2015).
Article Google Scholar
Obermeyer, Z. & Emanuel, E. J. Predicting the future—Big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219. https://doi.org/10.1056/NEJMp1606181 (2016).
Article PubMed PubMed Central Google Scholar
van’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536. https://doi.org/10.1038/415530a (2002).
Article Google Scholar
van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009. https://doi.org/10.1056/NEJMoa021967 (2002).
Article PubMed Google Scholar
Xu, X., Zhang, Y., Zou, L., Wang, M. & Li, A. A gene signature for breast cancer prognosis using support vector machine. In 2012 5th International Conference on BioMedical Engineering and Informatics 928–931. https://doi.org/10.1109/BMEI.2012.6513032 (2012).
Nguyen, C., Wang, Y. & Nguyen, H. N. Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. J. Biomed. Sci. Eng. 06, 551–560. https://doi.org/10.4236/jbise.2013.65070 (2013).
Article Google Scholar
Sun, Y., Goodison, S., Li, J., Liu, L. & Farmerie, W. Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics (Oxford, England) 23, 30–37. https://doi.org/10.1093/bioinformatics/btl543 (2007).
Article CAS PubMed Google Scholar
Gevaert, O., De Smet, F., Timmerman, D., Moreau, Y. & De Moor, B. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics (Oxford, England) 22, e184-190. https://doi.org/10.1093/bioinformatics/btl230 (2006).
Article CAS PubMed Google Scholar
Khademi, M. & Nedialkov, N. S. Probabilistic graphical models and deep belief networks for prognosis of breast cancer. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) 727–732. https://doi.org/10.1109/ICMLA.2015.196 (2015).
Das, J., Gayvert, K. M., Bunea, F., Wegkamp, M. H. & Yu, H. ENCAPP: Elastic-net-based prognosis prediction and biomarker discovery for human cancers. BMC Genom. 16, 263. https://doi.org/10.1186/s12864-015-1465-9 (2015).
Article CAS Google Scholar
Sun, D., Li, A., Tang, B. & Wang, M. Integrating genomic data and pathological images to effectively predict breast cancer clinical outcome. Comput. Methods Programs Biomed. 161, 45–53. https://doi.org/10.1016/j.cmpb.2018.04.008 (2018).
Article PubMed Google Scholar
Moon, W. K. et al. Computer-aided prediction of axillary lymph node status in breast cancer using tumor surrounding tissue features in ultrasound images. Comput. Methods Programs Biomed. 146, 143–150. https://doi.org/10.1016/j.cmpb.2017.06.001 (2017).
Article PubMed Google Scholar
Kwak, J. T. & Hewitt, S. M. Multiview boosting digital pathology analysis of prostate cancer. Comput. Methods Programs Biomed. 142, 91–99. https://doi.org/10.1016/j.cmpb.2017.02.023 (2017).
Article PubMed PubMed Central Google Scholar
Wang, H., Xing, F., Su, H., Stromberg, A. & Yang, L. Novel image markers for non-small cell lung cancer classification and survival prediction. BMC Bioinform. 15, 310. https://doi.org/10.1186/1471-2105-15-310 (2014).
Article Google Scholar
Zhu, X. et al. Lung cancer survival prediction from pathological images and genetic data - An integration study. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), 1173–1176, https://doi.org/10.1109/ISBI.2016.7493475 (2016). ISSN: 1945-8452.
Yu, K.-H. et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat. Commun. 7, 12474. https://doi.org/10.1038/ncomms12474 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Tang, B., Li, A., Li, B. & Wang, M. CapSurv: Capsule network for survival analysis with whole slide pathological images. IEEE Access 7, 26022–26030. https://doi.org/10.1109/ACCESS.2019.2901049 (2019).
Article Google Scholar
Arya, N. & Saha, S. Generative incomplete multi-view prognosis predictor for breast cancer: GIMPP. In IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1 July–August 2022, https://doi.org/10.1109/TCBB.2021.3090458.
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics (Oxford, England) 17, 520–525. https://doi.org/10.1093/bioinformatics/17.6.520 (2001).
Article CAS PubMed Google Scholar
Muñoz-Aguirre, M., Ntasis, V. F., Rojas, S. & Guigó, R. PyHIST: A histological image segmentation tool. PLoS Comput. Biol. 16, e1008349. https://doi.org/10.1371/journal.pcbi.1008349 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778. https://doi.org/10.1109/CVPR.2016.90 (IEEE, 2016).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y (2015).
Article MathSciNet Google Scholar
Aliper, A. et al. Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13, 2524–2530. https://doi.org/10.1021/acs.molpharmaceut.6b00248 (2016).
Article CAS PubMed PubMed Central Google Scholar
Das, U., Srizon, A. Y., Al Mehedi Hasan, M., Rahman, J. & Ben Islam, M. K. Effective data dimensionality reduction workflow for high-dimensional gene expression datasets. In 2020 IEEE Region 10 Symposium (TENSYMP) 182–185. https://doi.org/10.1109/TENSYMP50017.2020.9230847 (IEEE, 2020).
Jolliffe, I. T. Principal Component Analysis. Springer Series in Statistics (Springer, 1986).

Download references

Acknowledgements

Sriparna Saha and Snehanshu Saha would like to thank the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya Ph.D. Scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) and the DBT-Builder project, BITS Pilani K K Birla Goa Campus (No. BT/INF/22/SP42543/2021) for supporting the work, respectively.

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Indian Institute of Technology, Patna, Bihar, 801106, India
Nikhilanand Arya & Sriparna Saha
Department of Information Science & Engineering, Nitte Meenkashi Institute of Technology, Bangalore, 560064, India
Archana Mathur
APPCAIR & CSIS, Birla Institute of Technology and Science, Pilani-Goa Campus, Pilani, Goa, 403726, India
Snehanshu Saha

Authors

Nikhilanand Arya
View author publications
Search author on:PubMed Google Scholar
Sriparna Saha
View author publications
Search author on:PubMed Google Scholar
Archana Mathur
View author publications
Search author on:PubMed Google Scholar
Snehanshu Saha
View author publications
Search author on:PubMed Google Scholar

Contributions

N.A., S.S. and S.S. conceived the idea, N.A. and A.M. conducted the experiment(s), N.A., S.S., and S.S. analyzed the results. N.A. wrote the manuscript with valuable input from S.S. and S.S. All authors reviewed the manuscript.

Corresponding author

Correspondence to Nikhilanand Arya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Arya, N., Saha, S., Mathur, A. et al. Improving the robustness and stability of a machine learning model for breast cancer prognosis through the use of multi-modal classifiers. Sci Rep 13, 4079 (2023). https://doi.org/10.1038/s41598-023-30143-8

Download citation

Received: 19 December 2022
Accepted: 16 February 2023
Published: 11 March 2023
Version of record: 11 March 2023
DOI: https://doi.org/10.1038/s41598-023-30143-8

This article is cited by

A comparative analysis of parametric survival models and machine learning methods in breast cancer prognosis
- Sonia Kaindal
- B. Venkataramana
Scientific Reports (2025)
DeepTGIN: a novel hybrid multimodal approach using transformers and graph isomorphism networks for protein-ligand binding affinity prediction
- Guishen Wang
- Hangchen Zhang
- Xiaowen Hu
Journal of Cheminformatics (2024)
A comprehensive investigation of multimodal deep learning fusion strategies for breast cancer classification
- Fatima-Zahrae Nakach
- Ali Idri
- Evgin Goceri
Artificial Intelligence Review (2024)
Novel research and future prospects of artificial intelligence in cancer diagnosis and treatment
- Chaoyi Zhang
- Jin Xu
- Si Shi
Journal of Hematology & Oncology (2023)
Breast cancer survival prognosis using the graph convolutional network with Choquet fuzzy integral
- Susmita Palmal
- Nikhilanand Arya
- Somanath Tripathy
Scientific Reports (2023)