Table 1 A summary of the previous literature in hybrid CNN-Transformer models, their contributions, and drawbacks.

Study	Model	Limitations	Key Features	Proposed Innovation
Zhang et al.¹⁷	CNN-Transformer with Attention	Complex architecture, lower performance for abstract art	Attention mechanisms to refine feature extraction	ArtFusionNet enhances multi-scale feature extraction and fusion of all artistic styles.
Liu et al.¹⁸	Multi-scale CNN-Transformer	Increased computational load, limited performance across diverse datasets	Multi-level feature extraction through pyramid pooling	ArtFusionNet employs pyramid pooling and dilated convolutions to carry out scalable and efficient feature extraction.
Huo et al.²⁰	Dual-band CNN-Transformer	High computational overhead, limited scalability	Simultaneous local and global feature extraction	ArtFusionNet performs local-global integration efficiently with no overhead using adaptive fusion.
Zhang et al.²¹	Dynamic Weighting CNN-Transformer	Struggles with balanced fusion and computational cost	Dynamic fusion of CNN and Transformer features	ArtFusionNet balances the feature fusion process by using an adaptive weighting mechanism.

Quick links

Search