Abstract
Breast cancer diagnosis from histopathology images remains challenging due to two intertwined factors: severe class imbalance, where malignant cases represent a small minority of samples, and the need to integrate discriminative features across multiple spatial scales. Existing methods typically address imbalance and multi-scale fusion separately, leading to biased or redundant representations. We propose CMAF-Net, a theoretically grounded architecture that unifies information bottleneck principles with margin-based learning to jointly tackle these challenges. CMAF-Net employs a dual-branch CNN-Transformer backbone fused through a Cross-Modal Attention Fusion block, which implements temperature-controlled attention and redundancy minimization to preserve complementary local and global features. At the classification level, we introduce an Adaptive Class-Balanced Focal Loss that operationalizes margin theory under imbalance, enforcing larger margins for minority classes while dynamically adapting to feature distributions. Extensive experiments on the IDC dataset show that CMAF-Net achieves 94.92% sensitivity and 95.52% balanced accuracy, outperforming state-of-the-art baselines by up to 8.6% on malignant detection. Under extreme 99:1 imbalance, CMAF-Net maintains 76.45% sensitivity, demonstrating graceful degradation where competing methods fail catastrophically. Cross-dataset evaluation on BreakHis confirms robust zero-shot transfer across four magnifications with average sensitivity of 95.61%. Ablation studies and information-theoretic analyses validate the contributions of each component, while computational profiling shows CMAF-Net achieves superior accuracy-efficiency trade-offs compared to prior fusion networks. Beyond breast cancer, our framework establishes a principled template for information-theoretic fusion under class imbalance, with implications for rare disease detection, clinical decision support, and broader multi-modal learning tasks.
Data availability
The IDC and BreakHis datasets used in this study are publicly available. The complete source code, trained weights, and experiment scripts will be released publicly on GitHub upon acceptance: https://github.com/wizzydredd/CMAF-Net
References
Siegel, R. L., Miller, K. D. & Wagle, N.S.& Jemal, A.,. Cancer statistics. CA: Cancer J. Clinicians 73(1), 17–48. https://doi.org/10.3322/caac.21763 (2023).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine 25(8), 1301–1309 (2019).
Madabhushi, A. & Lee, G. Image analysis and machine learning in digital pathology: Challenges and opportunities. Medical Image Anal. 33, 170–175 (2016).
He, K., Zhang, X., Ren, S.& Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G.& Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (2021)
Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: Marrying convolution and attention for all data sizes. Adv. Neural Inform. Processing Syst. 34, 3965–3977 (2021).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: synthetic minority over-sampling technique. J. Artificial Intell. Res. 16, 321–357 (2002).
Cui, Y., Jia, M., Lin, T.-Y., Song, Y.& Belongie, S. Class-balanced loss based on effective number of samples. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9268–9277 (2019)
Lin, T.-Y., Goyal, P., Girshick, R., He, K.& Dollár, P. Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision, 2980–2988 (2017)
Zhang, Z., Xu, M., Zhang, W. & Li, Q. Information fusion for multi-scale data: Survey and challenges. Information Fusion 89, 391–417 (2023).
Tishby, N., Pereira, F.C.& Bialek, W. The information bottleneck method. arXiv preprint physics/0004057 (2000)
Cao, K., Wei, C., Gaidon, A., Arechiga, N.& Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems 32 (2019)
Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Transa. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019).
Zhao, Y. et al. A comprehensive survey on deep learning based data fusion methods in smart healthcare systems. Information Fusion 108, 102361 (2024).
Gao, J., Li, P., Chen, Z. & Zhang, J. A survey on deep learning for multimodal data fusion. Neural Computation 32(5), 829–864 (2020).
Ramachandram, D. & Taylor, G. W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017).
Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S.& Fu, H. Transformers in medical imaging: A survey. Medical Image Analysis, 102802 (2023)
Zhou, S. K. et al. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE 109(5), 820–838 (2021).
Huang, Y. et al. What makes multi-modal learning better than single (provably). Adv. Neural Inform. Processing Syst. 34, 10944–10956 (2021).
Zhang, Y., Liu, H.& Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, 14–24 (2021). Springer
Chen, C.-F.R., Fan, Q.& Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Liu, J. et al. Multi-level feature fusion network combining attention mechanisms for polyp segmentation. Information Fusion 104, 102195 (2024).
Cai, Z. et al. Dafnet: A novel dynamic adaptive fusion network for medical image classification. Information Fusion 126, 103507 (2026).
Nagrani, A., Yang, S., Arnab, A., Schmid, C. & Sun, C. Attention bottlenecks for multimodal fusion. Adv. Neural Inform. Processing Syst. 34, 14200–14213 (2021).
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A.& Carreira, J. Perceiver: General perception with iterative attention. International conference on machine learning, 4651–4664 (2021). PMLR
Zhou, T., Fu, H., Zhang, Y., Zhang, C., Lu, X., Shen, J.& Shao, L. Multimodal learning in clinical imaging: A comprehensive survey. Medical Image Analysis, 102859 (2023)
Wang, H. et al. Tinyvit-lightgbm: A lightweight and smart feature fusion framework for iomt-based cancer diagnosis. Information Fusion 125, 105253 (2025).
Liu, J., Zhang, Y., Chen, J.-N., Xiao, J., Lu, Y., Landman, B.A., Yuan, Y., Yuille, A., Tang, Y.& Zongwei, Z. Clip-driven universal model for organ segmentation and tumor detection. Proceedings of the IEEE/CVF International Conference on Computer Vision, 21152–21164 (2023)
Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019).
Mullick, S.S., Datta, S.& Das, S. Generative adversarial minority oversampling. Proceedings of the IEEE/CVF International Conference on Computer Vision, 1695–1704 (2019)
Zhang, H., Xu, H., Tian, X., Jiang, J. & Ma, J. Deep learning-based methods for medical image fusion: A review. Comput. Biol. Med. 136, 104664 (2021).
Chlap, P. et al. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiation Oncology 65(5), 545–563 (2021).
Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A.& Kumar, S. Long-tail learning via logit adjustment. International Conference on Learning Representations (2021)
Li, X., Sun, X., Meng, Y., Liang, J., Wu, F.& Li, J. Dice loss for data-imbalanced nlp tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 465–476 (2020)
Kini, G. R., Paraskevas, O., Oymak, S. & Thrampoulidis, C. Label-imbalanced and group-sensitive classification under overparameterization. Adv. Neural Inform. Processing Syst. 34, 18970–18983 (2021).
Menon, A., Rawat, A. S., Reddi, S. & Kumar, S. Statistical consistency and convergence of label noise learning under class-conditional noise models. J. Mach. Learn. Res. 22(159), 1–53 (2021).
Collell, G., Prelec, D. & Patil, K. R. Unbiased loss functions for imbalanced classification. Pattern Recognition 131, 108881 (2022).
Shwartz-Ziv, R.& Tishby, N. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017)
Alemi, A.A., Fischer, I., Dillon, J.V.& Murphy, K. Deep variational information bottleneck. International Conference on Learning Representations (2017)
Saxe, A. M. et al. On the information bottleneck theory of deep learning. J. Statistical Mech.: Theory and Experiment 2019(12), 124020 (2019).
Goldfeld, Z. & Polyanskiy, Y. The information bottleneck problem and its applications in machine learning. IEEE J. Selected Areas Inform. Theory 1(1), 19–38 (2020).
Geiger, B.C.& Kubin, G. Information-theoretic perspective on generalization and memorization in machine learning. IEEE Transactions on Information Theory (2021)
Federici, M., Dutta, A., Forré, P., Kushman, N.& Akata, Z. Learning robust representations via multi-view information bottleneck. International Conference on Learning Representations (2020)
Wang, S. et al. Multi-view information bottleneck for medical image analysis. Medical Image Anal. 85, 102765 (2023).
Pluim, J. P., Maintz, J. A. & Viergever, M. A. Mutual-information-based registration of medical images: a survey. IEEE Transa. Med. Imaging 22(8), 986–1004 (2003).
Guo, Y., Wu, J., Li, L. & Gao, X. Mutual information-based multimodal image registration: A review. Neurocomputing 492, 644–663 (2022).
Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.-W.& Heng, P.-A. Information fusion for multi-modality medical image segmentation: A survey. Artificial Intelligence in Medicine, 102547 (2023)
Elton, D.C. Self-explaining neural networks: A review. arXiv preprint arXiv:2105.05837 (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.& Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A.& Jégou, H. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning, 10347–10357 (2021). PMLR
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L.& Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H.& Douze, M. Levit: a vision transformer in convnet’s clothing for faster inference. Proceedings of the IEEE/CVF International Conference on Computer Vision, 12259–12269 (2021)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L.& Zhang, L. Cvt: Introducing convolutions to vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31 (2021)
Mehta, S.& Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. International Conference on Learning Representations (2022)
Srinidhi, C. L., Ciga, O. & Martel, A. L. Deep neural network models for computational histopathology: A survey. Med. Image Anal. 67, 101813 (2021).
Dimitriou, N., Arandjelović, O. & Caie, P. D. Deep learning for whole slide image analysis: an overview. Front. Med. 6, 264 (2019).
Spanhol, F. A., Oliveira, L. S., Petitjean, C. & Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Transa. Biomed. Eng. 63(7), 1455–1462 (2016).
Yan, R. et al. Breast cancer histopathological image classification using a hybrid deep neural network. Methods 173, 52–60 (2020).
Tellez, D. et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med. Image Anal. 58, 101544 (2019).
Ilse, M., Tomczak, J.& Welling, M. Attention-based deep multiple instance learning. International conference on machine learning, 2127–2136 (2018). PMLR
Li, B., Li, Y.& Eliceiri, K.W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14318–14328 (2021)
Brown, G., Wyatt, J., Harris, R. & Yao, X. Diversity creation methods: a survey and categorisation. Information fusion 6(1), 5–20 (2005).
Cover, T. M. & Thomas, J. A. Elements of Information Theory 2nd edn. (John Wiley & Sons, Hoboken, New Jersey, 2006).
Macenko, M., Niethammer, M., Marron, J.S., Borland, D., Woosley, J.T., Guan, X., Schmitt, C.& Thomas, N.E. A method for normalizing histology slides for quantitative analysis. 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 1107–1110 (2009). IEEE
Wang, X., Girshick, R., Gupta, A.& He, K. Attention mechanisms in computer vision: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
Xiong, Y. et al. Nyströmformer: A nyström-based algorithm for approximating self-attention. Proceed. AAAI Conf. Artificial Intell. 35(16), 14138–14148 (2021).
Loshchilov, I.& Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2019)
Foret, P., Kleiner, A., Mobahi, H.& Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. International Conference on Learning Representations (2021)
Dao, T., Fu, D., Ermon, S., Rudra, A. & Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inform. Process. Syst. 35, 16344–16359 (2022).
Cruz-Roa, A., Basavanhally, A., González, F., Gilmore, H., Feldman, M., Ganesan, S., Shih, N., Tomaszewski, J.& Madabhushi, A. Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. Medical Imaging 2014: Digital Pathology 9041, 904103 (2014)
Yang, Y., Zha, S., Wang, J. & Zhang, Z. A survey on long-tailed visual recognition. Int. J. Computer Vision 130(7), 1837–1872 (2022).
Huang, G., Liu, Z., Van Der Maaten, L.& Weinberger, K.Q. Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017)
Tan, M.& Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, 6105–6114 (2019). PMLR
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T.& Xie, S. A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11976–11986 (2022)
Ding, J., Xue, N., Xia, G.-S., Dai, D. & Yang, M.Y. Hrfnet: High-resolution feature network for dense prediction. arXiv preprint arXiv:2108.07697 (2021)
Joze, H.R.V., Shaban, A., Iuzzolino, M.L. & Koishida, K. Mmtm: Multimodal transfer module for cnn fusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13289–13299 (2020)
Ren, J. et al. Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inform. Process. Syst. 33, 4175–4186 (2020).
Acknowledgements
The authors gratefully acknowledge the financial support that made this research possible. This work was supported by the National Natural Science Foundation of China (Grant No. U22B2061), the Institute of Information & Communications Technology Planning & Evaluation (IITP) - Information Technology Research Center (ITRC) grant funded by the Ministry of Science and ICT, Republic of Korea (Grant No. IITP-2025-RS-2024-00437191), and the Deanship of Scientific Research, King Khalid University, Saudi Arabia (Grant No. RGP2/314/45).
Funding
This work was supported by the National Natural Science Foundation of China (Grant No. U22B2061), the Institute of Information & Communications Technology Planning & Evaluation (IITP) - Information Technology Research Center (ITRC) grant funded by the Ministry of Science and ICT, Republic of Korea (Grant No. IITP-2025-RS-2024-00437191), and by the Deanship of Scientific Research, King Khalid University, Saudi Arabia (Grant No. RGP2/314/45).
Author information
Authors and Affiliations
Contributions
W.X.A. conceived and designed the study, developed the methodology, curated the data, performed the formal analysis, and wrote the original draft of the manuscript. W.X.A., W.C., L.K., W.A., F.S., M.A.A.-a., Y.H.G., and A.A. contributed to writing, reviewing, and editing the manuscript. W.X.A., W.A., and A.A. curated the data, while W.X.A. and L.K. carried out the formal analyses. L.K. and F.S. contributed to visualization, and F.S. conducted the experiments. W.C., W.A., and Y.H.G. contributed to validation, and W.C. provided supervision. W.C., M.A.A.-a. and Y.H.G. acquired funding, and A.A. managed the project.All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ativi, W.X., Chen, W., Kwao, L. et al. CMAF-Net: cross-modal attention fusion with information-theoretic regularization for imbalanced breast cancer histopathology. Sci Rep (2026). https://doi.org/10.1038/s41598-025-32794-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-32794-1