Vision language models versus machine learning models performance on polyp detection and classification in colonoscopy images

Khalafi, Mohammad Amin; Safavi-Naini, Seyed Amir Ahmad; Salehi, Ameneh; Naderi, Nariman; Alijanzadeh, Dorsa; Moghadam, Pardis Ketabi; Kavousi, Kaveh; Golestani, Negar; Shahrokh, Shabnam; Fallah, Soltanali; Samaan, Jamil S.; Tatonetti, Nicholas P.; Hoerter, Nicholas; Nadkarni, Girish; Aghdaei, Hamid Asadzadeh; Soroush, Ali

doi:10.1038/s41598-025-29566-2

Download PDF

Article
Open access
Published: 27 November 2025

Vision language models versus machine learning models performance on polyp detection and classification in colonoscopy images

Mohammad Amin Khalafi¹^na1,
Seyed Amir Ahmad Safavi-Naini^1,2,3^na1,
Ameneh Salehi¹,
Nariman Naderi¹,
Dorsa Alijanzadeh¹,
Pardis Ketabi Moghadam¹,
Kaveh Kavousi⁴,
Negar Golestani²,
Shabnam Shahrokh¹,
Soltanali Fallah⁵,
Jamil S. Samaan⁶,
Nicholas P. Tatonetti^7,8,9,
Nicholas Hoerter¹⁰,
Girish Nadkarni^2,3,
Hamid Asadzadeh Aghdaei¹ &
…
Ali Soroush^2,3,10

Scientific Reports volume 15, Article number: 45484 (2025) Cite this article

1407 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Medical image analysis is central to clinical decision-making, and recent advances in vision–language models (VLMs) have introduced promising capabilities for jointly processing visual and textual data. This study evaluates zero-shot VLMs against convolutional neural networks (CNNs) and classical machine learning (CML) models for polyp detection (CADe) and classification (CADx) using 2,258 colonoscopy images from 428 patients with histopathological labels. We benchmarked 15 approaches including ResNet50, five CMLs (random forest, support vector machine, logistic regression, decision tree, Gaussian naive Bayes), two contrastive vision–language encoders (CLIP, BiomedCLIP), and seven frontier VLMs (GPT-4, GPT-4.1, GPT-4.1-mini, Gemma-3-27b, Qwen-2.5-vl-72b, Gemini-1.5-Pro, Claude-3-Opus). For polyp detection, the highest-performing VLMs (GPT-4.1 F1: 91.98%, GPT-4.1-mini F1: 91.16%) matched CNN performance (ResNet50 F1: 91.35%), though substantial variability existed across VLMs (F1 range: 19.37% to 91.98%). For classification, CNNs substantially outperformed VLMs: ResNet50 achieved weighted F1 of 74.94% versus 55.07% for GPT-4.1-mini, with performance gaps widening dramatically for rare polyp subtypes where VLMs often achieved 0% F1. External validation on 75 images showed that while ResNet50 performance declined substantially, some VLMs demonstrated more stable cross-institutional performance. These findings establish a task-dependent performance hierarchy where VLMs match CNNs for detection but remain limited for classification, suggesting distinct clinical roles for each approach.

Introduction

Colonoscopy remains the gold standard for colorectal cancer screening, yet its effectiveness is fundamentally limited by operator-dependent variability in polyp detection and characterization^1,2. These limitations have motivated the development of artificial intelligence systems to assist clinicians during colonoscopy: computer-aided detection (CADe) systems that help identify polyps in real time, and computer-aided diagnosis (CADx) systems that suggest the likely histological type based on visual appearance. Such tools aim to reduce missed lesions, improve diagnostic accuracy, and help standardize the quality of colonoscopy across different practice settings³.

Deep learning approaches have transformed colorectal cancer detection and diagnosis across multiple clinical applications. Convolutional neural networks (CNN) pretrained on large image datasets and fine-tuned on medical images have demonstrated robust performance not only for polyp detection and classification during colonoscopy, but also for histopathological subtyping, prognostic prediction from tissue samples, and treatment response assessment. Recent architectures including YOLO variants, enhanced U-Net models, and transformer-based approaches have achieved detection sensitivities exceeding 90% and real-time processing capabilities suitable for clinical deployment^4,5,6,7,8,9 (Table 1). However, the CNN paradigm imposes significant development constraints. Each new model requires extensive labeled training data specific to the target population and imaging equipment, iterative optimization of network architectures and hyperparameters, and validation across multiple institutions to ensure generalizability. These requirements make CNN development resource-intensive and create barriers to rapid adaptation as clinical needs or imaging technology evolve.

Table 1 Overview of studies assessing the performance of deep learning models in medical imaging.

Full size table

Advances in vision-language models (VLM) suggest an alternative approach that addresses data and development barriers. Contrastive Language-Image Pre-training (CLIP) demonstrated that joint training of visual and language encoders on large-scale image-text pairs enables zero-shot task performance through natural language prompts alone¹⁰. Unlike CNNs that require fine-tuning on labeled medical images to adapt pretrained features to specific tasks, CLIP-based models can be deployed directly through prompt specification. BiomedCLIP extended this framework to the biomedical domain through pretraining on 15 million figure-caption pairs from PubMed Central¹¹, improving medical imaging performance while maintaining zero-shot deployment. More recently, large VLM such asGPT-4¹², Claude-3-Opus¹³, and Gemini-1.5-Pro¹⁴ have integrated sophisticated visual encoders with transformer-based language models, enabling complex reasoning about medical images without any task-specific fine-tuning (Table 2)^{15,16,17,18,19,20,21}. These models represent a fundamentally different deployment paradigm: rather than adapting model weights to each clinical task, the same pretrained model is applied across diverse applications through natural language instructions.

Table 2 Overview of studies assessing the performance of vision Language models in medical imaging.

Full size table

The potential advantages of zero-shot VLM for CADe and CADx are substantial. Eliminating the fine-tuning step removes the need for institution-specific labeled datasets and model optimization. Prompt-based interaction allows flexible task specification without retraining. Pretraining on billions of diverse images may confer robustness to the distribution shifts that degrade fine-tuned CNN performance across institutions. However, these theoretical advantages remain unvalidated for colonoscopy applications. Whether zero-shot VLMs can match the detection sensitivity of fine-tuned CNNs, how they perform across histological classification tasks of varying difficulty, whether they generalize better to external datasets, and how sensitive they are to prompt design are empirical questions with direct implications for clinical deployment strategies.

We systematically evaluated 15 computational approaches spanning classical machine learning (CML), CNN, contrastive vision-language encoders, and state-of-the-art VLMs for frame-level polyp detection and histological classification. Using 2,258 colonoscopy images with pathological confirmation and external validation on 75 images from an independent institution, we compared zero-shot VLM performance against fine-tuned CNNs and classical methods across both binary detection and multi-class classification tasks. We further investigated prompt engineering strategies, few-shot learning, computational requirements, and model interpretability to assess practical deployment considerations. Our results establish performance benchmarks across model families, reveal task-dependent capabilities and limitations, and provide evidence-based guidance for selecting appropriate approaches based on clinical requirements and available resources.

Methods

Ethical consideration

This study received ethical approval from the Institutional Review Board at the Research Ethics Committees of the Research Institute for Gastroenterology & Liver Diseases at Shahid Beheshti University of Medical Sciences (IR.SBMU.RIGLD.REC.1401.043). In accordance with the principles outlined in the Helsinki Declaration, patient confidentiality and welfare was maintained throughout the study. All procedures involving patient data and images were conducted using standardized protocols to safeguard patient privacy, with measures in place to anonymize data and prevent identification. Explicit informed consent was obtained from all participants, affirming their voluntary participation in the study.

The external dataset used in this study was anonymized and obtained under a signed data agreement. The dataset provider had secured prior ethical approval for its collection and use, and is registered in National Registry of Biobanks (B.0000140) and ISCIII Biomodels and Biobanks Platform (PT23/00013).

Experimental framework

This investigation followed a retrospective, comparative methodological design to evaluate multiple artificial intelligence approaches for colonoscopy image analysis. We adhered to Consolidated reporting guidelines for prognostic and diagnostic machine learning modeling studies²² and the transparent reporting of a multivariable prediction model for Individual prognosis or diagnosis (TRIPOD-AI)²³ guidelines for model development and results reporting, ensuring methodological transparency and reproducibility. We structured our investigation as a three-phase experimental program designed to systematically evaluate model performance:

1.
Parameter Optimization Phase: We systematically identified optimal hyperparameters for each model architecture through comprehensive grid search methodologies, establishing optimized configurations for subsequent performance evaluation.
2.
Detection Evaluation Phase: We conducted comparative assessments of model performance in identifying polyp presence (CADe functionality), utilizing standardized metrics, including F1 scores and area under the receiver operating characteristic curve (AUROC).
3.
Classification Analysis Phase: We performed systematic evaluation of model efficacy in correctly classifying polyp pathology types (CADx functionality) across six distinct histological categories, employing weighted evaluation metrics to account for class distribution.

This structured approach enabled comprehensive, controlled comparison across diverse computational methodologies while maintaining consistent evaluation standards.

Dataset - Characteristics

Patient population and data collection

We examined colonoscopy data collected between December 2022 and April 2023 at Taleghani Hospital’s gastroenterology clinic and Behbood clinic. The study population comprised 428 patients (mean age: 53 ± 14 years; 48.6% male) who underwent colonoscopy for primary colorectal cancer screening, post-polypectomy surveillance, evaluation following positive fecal immunochemical tests, or investigation of gastrointestinal symptoms.

All procedures were performed by gastroenterologists with extensive experience (> 2,000 screening colonoscopies conducted). The endoscopists assessed bowel preparation quality using the validated Boston Bowel Preparation Scale and confirmed cecal intubation through identification of the ileocecal valve and appendix orifice.

Image collection and histopathological assessment

We compiled a comprehensive image dataset consisting of 1,129 colon polyp images and 1,129 randomly selected normal colon images (from an original pool of 6,046) to address class imbalance. The initial classification was derived from procedure pathology reports, followed by an expert review of stored images by an experienced gastroenterologist (PKM) who assigned final labels.

Tissue samples underwent standard histopathological processing, including formalin fixation, paraffin embedding, sectioning (4–5 microns thick), and hematoxylin-eosin staining. Histological classification followed established criteria, with assessment of cellular atypia, glandular architecture, and dysplasia degree²⁴. Our final dataset comprised 2,258 images from 428 patients, including tubular adenoma (n = 771), hyperplastic polyp (n = 138), adenocarcinoma (n = 79), tubulovillous adenoma (n = 59), inflammatory polyp (n = 45), villous adenoma (n = 36), and normal colon (n = 1,129). Complete dataset characteristics are provided in Table 3.

Table 3 Characteristics of the dataset at both patient and image levels.

Full size table

External dataset for validation

Sample images and anonymized patient data used in this study were obtained from the PICCOLO database of the Basque Biobank (www.biobancovasco.bioef.eus), which is registered in the National Registry of Biobanks (B.0000140) and integrated into the ISCIII Biomodels and Biobanks Platform (PT23/00013). This dataset contains 3433 images from clinical colonoscopy videos, including white light and narrow band imaging (NBI) images, from colonoscopy procedures in human patients. It includes 76 different lesions from 48 patients. We selected a total of 75 images, comprising 9 adenocarcinomas, 50 adenomatous polyps, and 16 hyperplastic polyps from white light images.

Since the external dataset contains no normal images and only three distinct polyp classes, we adapted our internal dataset by selecting and organizing it in the same way, allowing for a consistent comparison between internal and external datasets.

Image preprocessing and data augmentation

We implemented a comprehensive preprocessing pipeline to optimize image quality and enhance model training. All images underwent uniform resizing to 300 × 300 pixels, followed by normalization to standardize pixel value distribution. To enhance model robustness and generalizability, we applied a systematic augmentation protocol incorporating horizontal and vertical mirroring to diversify polyp orientation representation, brightness variations to simulate diverse lighting conditions, Gaussian blur application to replicate optical aberrations, additive Gaussian noise to build resilience against image artifacts, and linear contrast adjustments to enhance structural differentiation. This augmentation strategy yielded a four-fold expansion of the effective training dataset, simultaneously enhancing model exposure to diverse image acquisition parameters and reducing overfitting to institution-specific imaging characteristics.

Model development and configuration

Classical machine learning approaches

We implemented five distinct classical machine learning algorithms, each optimized through systematic hyperparameter tuning (Table 4). For the Decision Tree Classifier, we employed a comprehensive grid search across multiple parameters, including criterion (‘gini’, ‘entropy’), max_depth (None, 10, 20, 30), min_samples_split (2, 5, 10), and min_samples_leaf (1, 2, 4). The optimal configuration identified was criterion=’entropy’, max_depth = 20, min_samples_leaf = 2, and min_samples_split = 2. For the Random Forest Classifier, our hyperparameter optimization encompassed n_estimators (50, 100, 200), max_depth (None, 10, 20, 30), min_samples_split (2, 5, 10), and min_samples_leaf (1, 2, 4). The optimal configuration was determined to be n_estimators = 200, min_samples_leaf = 1, min_samples_split = 10, and random_state = 42. With the Support Vector Machine (SVM), we systematically evaluated parameter combinations including C (0.1, 1, 10), kernel (‘linear’, ‘rbf’, ‘poly’), and gamma (‘scale’, ‘auto’). The optimal configuration identified was kernel=’rbf’, C = 10, gamma=’scale’, probability = True, and random_state = 42. For Logistic Regression, our grid search evaluated C values (0.1, 1, 10) and solver options (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’). The optimal configuration was determined to be C = 0.1, solver=’sag’, and random_state = 42. The Gaussian Naive Bayes algorithm was implemented with default parameters as it does not feature adjustable hyperparameters.

Table 4 Hyperparameter Tuning.

Full size table

Convolutional neural network: Resnet 50

We implemented ResNet50 based on its demonstrated superior performance in medical image classification tasks²⁵. To optimize performance, we conducted systematic hyperparameter tuning via GridSearchCV, evaluating learning_rate (0.01, 0.1, 1), epochs (5, 10, 15), and batch_size (32, 64). The grid search involved dividing the dataset into training, validation, and testing subsets, training the model on various hyperparameter combinations, and using cross-validation to evaluate performance and prevent overfitting. The optimal configuration was determined to be learning_rate = 0.01, epochs = 15, and batch_size = 32.

Contrastive multimodal encoders

We evaluated two specialized contrastive learning models for our analysis. CLIP represents a general-purpose multimodal model that associates images with corresponding textual descriptions through dual visual and textual encoders trained on 400 million image-text pairs¹⁰. We implemented the ViT-B/32 variant for zero-shot evaluation in our experimental framework. Additionally, we assessed BiomedCLIP, a domain-specialized adaptation of CLIP that underwent pretraining on PMC-15 M, a dataset comprising 15 million biomedical figure-caption pairs from PubMed Central publications¹¹. This biomedical specialization potentially enhances performance for medical imaging applications, making it particularly relevant for our colonoscopy image analysis.

General-Purpose vision Language models

We evaluated seven state-of-the-art VLMs as part of our comprehensive assessment. GPT-4 represents an enhanced iteration of OpenAI’s GPT-4 model that integrates advanced visual processing capabilities, enabling interpretation of and response to image inputs²⁶. In addition, we assessed the performance of state-of-the-art OpenAI models, namely GPT-4.1 and GPT-4.1-mini. We also included Claude-3-Opus, developed by Anthropic, which builds upon their Claude architecture with enhanced visual question answering capabilities¹³. The fifth and sixth models in our evaluation was Gemini-1.5-Pro, Google’s multimodal foundation model designed for versatile tasks including visual comprehension, classification, and content generation across modalities¹⁴ and Gemma-3-27B. The last model in our evaluation was Qwen-2.5-VL-72B. These general-purpose models were evaluated without domain-specific fine-tuning to assess their zero-shot capabilities in medical image analysis.

We utilized the web-based API interface of GPT-4 (gpt-4-1106-vision-preview; Accessed: May 2024 via API), GPT-4.1 (Created Apr 14, 2025; Accessed: August 2025 via API), GPT4.1-mini (Created Apr 14, 2025; Accessed: August 2025 via API), Claude-3-Opus (claude-3-opus-20240229; Accessed: May 2024 via API ), Qwen-2.5-vl-72B (Created Feb 1, 2025; Accessed: August 2025 via API), Gemma-3-27B (Created Mar 12, 2025; Accessed: August 2025 via API) and Gemini-1.5-Pro (gemini-1.5-pro-001; Accessed: June 2024 via Google interface),. Approximately 15% of our test dataset was allocated for the parameter optimization phase, while the remaining 85% was used for the detection and classification evaluation phase. All experiments were conducted with standardized parameters (temperature = 1.0, maximum tokens = 512, tool calls disabled, random seed = 123) to ensure consistent evaluation conditions. We renamed all image file names to avoid any data leakage from the image metadata.

To assess the impact of prompt optimization, we first used the following raw prompt in a chat: “What is this image?” accompanied by the image. Subsequently, in the same chat, we asked: “What is the pathology class of the polyp? Give me only one answer.” In a separate chat, we then posed this engineered prompt:

“As an esteemed gastroenterologist specializing in colonoscopy evaluation, your expertise is crucial in meticulously assessing a provided colonoscopy image. Your task is to discern and characterize any irregularities present across the colonic mucosa, paying close attention to morphology, color variations, and vascularity patterns. Drawing upon your wealth of experience, construct a comprehensive list of potential diagnoses, including but not limited to inflammatory bowel disease, colorectal polyps, diverticulosis, and colorectal cancer. Your discerning analysis and diagnostic acumen will guide subsequent clinical decisions, emphasizing the importance of accurate interpretation and effective communication in delivering optimal patient care.”

This was followed by the image. Then, in the same chat, we used the prompt:

“Analyze the provided image and select one of the following options that accurately describes the patient’s diagnosis:

1.
normal.
2.
adenocarcinoma.
3.
adenomatous-tubular polyp.
4.
adenomatous-tubulovillous polyp.
5.
adenomatous-villous polyp.
6.
hyperplastic polyp.
7.
inflammatory polyp.”

The optimized prompt was developed through a human-in-the-loop refinement process whereby candidate variations were generated using GPT-4, informed by prompt engineering techniques adapted from validated gastroenterology-specific methodologies²⁷. These techniques included contextual embedding (providing task-specific domain context), expert mimicry (emulating clinical specialist reasoning patterns), chain-of-thought reasoning (eliciting stepwise analytical processes), exemplar anchoring (supplying representative clinical scenarios), and constrained output formatting (defining structured response schemas). A domain expert subsequently reviewed and refined the candidate to produce the final optimized prompt.

Exploring Few-Shot injection impact on General-Purpose vision Language models

For few-shot learning, we selected representative images directly from the training dataset to serve as illustrative examples for the model. Specifically, we curated two sets of images with corresponding labels. The first set (1 image for ‘no-polyp’ and 1 image for ‘polyp’) focused on distinguishing between polyp and non-polyp cases, providing general guidance on the presence or absence of polyps. The second set (one image for ‘normal’ and one image for each polyp subtypes) concentrated on specific pathology classes. Each few-shot example consisted of an image paired with a descriptive label, and these were included in the prompt to the model to facilitate accurate and informed predictions on unseen test images. We applied few-shot learning to recently released, state-of-the-art VLMs (GPT-4.1, GPT-4.1-mini, Qwen-2.5-vl-72b and Gemma-3-27b).

Performance evaluation

We developed an approach that converts unstructured text into structured classifications using GPT-4 to facilitate the semi-automated evaluation of textual outputs. The model was configured with a temperature setting of 0 and enabled to generate structured JSON outputs.

The extraction system was designed to categorize VLM responses into predefined labels with explicit handling of uncertain or ambiguous cases. For polyp detection, the system classified responses into: (1) “Human evaluation needed: I am unsure,” (2) “Human evaluation needed: More than one diagnosis is selected, or no option is selected,” (3) “The unstructured answer selected: No polyp is detected in the image,” or (4) “The unstructured answer selected: A polyp is detected in the image.” For polyp classification, an additional category was included: (5) “The unstructured answer selected: The polyp type is classified as {polyp_type in polyp_types}.”

This structured extraction approach enables automated classification while flagging ambiguous or uncertain cases for human review, ensuring accuracy in the evaluation process. The system processes free-text responses by identifying key diagnostic terminology, matching it to predefined categories, and assigning confidence scores. Responses containing hedging language (“possibly,” “might be,” “unclear”) or multiple conflicting diagnoses were automatically flagged for human review.

To validate this approach, we manually reviewed a random sample of 50 response-extraction pairs. GPT-4 correctly extracted and labeled all 43 unambiguous responses while appropriately flagging 7 cases requiring human evaluation, demonstrating 100% accuracy for clear cases and appropriate conservative handling of ambiguous outputs. All flagged cases were subsequently reviewed by a clinical expert (PKM) to assign final labels.

Statistical analysis

We performed comprehensive statistical analysis using Python (version 3.11.5), employing standardized machine learning evaluation methodologies. We implemented the one-vs-all strategy for multiclass classification scenarios to enable binary performance metrics for each class. We selected the F1 score as our primary evaluation metric due to its balanced consideration of both precision and recall, making it particularly suitable for our dataset where class imbalance was present, especially in the polyp classification tasks where some pathology types had limited representation.

Performance was evaluated using multiple complementary metrics: F1 scores to balance precision and recall considerations; AUROC to assess discriminative capability; confusion matrices to visualize classification patterns and error types; and weighted metrics to account for class imbalance in overall performance assessment. For weighted F1 scores in polyp classification tasks, we calculated values based on the proportion of each polyp type in the test dataset, ensuring that performance metrics appropriately reflected the distribution of classes in clinical settings.

TiLense: importance of tiles for vlm’s Zero-Shot polyp detection

This proposed approach seeks to identify and visualize key image tiles in vision-language tasks by assessing the significance of each tile through frequent responses across multiple prediction attempts. In contrast to complex methods, it focuses on a single, dominant answer instead of the original model probability. The procedure involves pinpointing the primary answer, evaluating tile significance, and then generating a heatmap to showcase these important areas. This model-agnostic unsupervised technique elucidates essential regions in VLM classification by juxtaposing tile-based results with a singular base answer after N iterations. By highlighting areas where significant variations occur, it uncovers which sections of an image most influence model predictions, which is beneficial for evaluation and improvement. We refer to this method as “TiLense” due to its capacity to highlight importance across image tiles for zero-shot prediction tasks.

We implemented this tile masking technique to showcase GPT-4.1 and GPT-4’s vision capabilities in zero-shot prediction tasks across four scenarios: the presence of a polyp, a polyp in a challenging background, a standard image, and a standard image in a complex background. A systematic sliding window approach masked specific regions of the images (see Fi.g. 3a). The original and masked images were evaluated by GPT-4 using a standardized prompt, with a temperature setting of 1, a maximum token limit of 300, and no specified seed value, with the process repeated five times to create response distributions. The base answers were established through majority voting. The output is represented as a heatmap, where each tile is colored according to its impact on altering the base answer. Since tiles can overlap, we scale each tile from 0 to 1, coloring them from white to red.

Libraries and local computing

For our VLM API calls, we used Python 3.11.5 in combination with the “requests” library, enabling efficient interaction with computational resources. Local experiments for CMLs and ResNet50 training and testing were conducted on a laptop equipped with a Ryzen 7–4800 H CPU and 16 GB of RAM, where we employed the scikit-learn and TensorFlow libraries for model implementation and evaluation.

Results

Model optimization

Our initial experimental phase focused on optimizing model configurations and prompt strategies for VLMs. We observed that domain-specific prompts consistently outperformed simple queries across all VLMs tested. For polyp detection, the smallest improvement was observed with Gemini-1.5-Pro (F1: from 0.715 to 0.731; +2.2%), while the largest gain was achieved by Qwen-2.5-vl-72b (F1: from 0.531 to 0.802; +51.0%). For polyp classification, the minimum improvement was seen in Claude-3-Opus (weighted F1: from 0.112 to 0.147; +31.2%), whereas the maximum improvement occurred with Qwen-2.5-vl-72b (weighted F1: from 0.008 to 0.502; +6175.0%). Table 5 provides a detailed comparison of performance improvements across prompting strategies, which formed the foundation for our subsequent analyses and mention prompt engineering techniques that we used. Supplementary Figures S1 and S2 present the confusion matrices of answers for polyp detection and classification, respectively.

Table 5 Impact of prompt engineering on vision Language model performance.

Full size table

Polyp detection performance (CADe)

Polyp detection performance established a clear hierarchical distribution across models, as demonstrated by confusion matrices (Fig. 1) and F1 scores (Table 6). GPT-4.1 achieved the highest performance (F1: 91.98%), closely followed by ResNet50 (F1: 91.35%) and GPT-4.1-mini (F1: 91.16%), demonstrating that latest-generation VLMs can match task-specific CNNs for binary detection. BiomedCLIP demonstrated strong results (F1: 88.68%), outperforming general CLIP (F1: 68.39%) by more than 20%. Traditional machine learning and earlier VLMs formed the next tier: Random Forest and GPT-4 (both F1: 81.02%), SVM (F1: 77.92%), and Logistic Regression (F1: 72.80%). Moderate capability was observed for Decision Tree (F1: 68.10%), Qwen-2.5-vl-72b (F1: 68.59%), Gemma-3-27b (F1: 69.29%), and Claude-3-Opus (F1: 66.40%). The lowest detection capability was exhibited by Gemini-1.5-Pro (F1: 19.37%) and Gaussian Naive Bayes (F1: 10.22%). Confusion matrices for all models are presented in Fig. 1. AUROC analysis (Fig. 2) reinforced these findings, with top performers achieving values above 0.95.

Table 6 Comparative analysis of machine learning models in polyp detection and Classification. Performance comparison of classical machine learning (CML) models, ResNet-50, vision Language models (VLMs), and specialized VLMs for polyp detection and classification tasks.

Full size table

We applied the TiLense tile-based importance mapping method to elucidate model decision-making processes for GPT-4.1 and GPT-4. Figure 3 presents attention heatmaps across four diagnostically relevant scenarios: normal mucosa (3b), standard polyp (3c), poorly prepared normal mucosa (3d), and subtle polyp (3e). GPT-4.1 demonstrated clinically appropriate attention allocation, with high-importance tiles accurately localizing polyp regions in clear cases (3c) and maintaining focus on pathologically relevant features across varying image quality conditions. In contrast, GPT-4 exhibited attention misallocation in challenging scenarios, incorrectly prioritizing artifacts in poorly prepared images (3d) and displaying dispersed attention patterns for subtle lesions (3e), revealing susceptibility to image quality degradation and low-contrast pathology. These attention pattern differences align with the models’ respective classification accuracies, suggesting that GPT-4.1’s performance gains reflect improved capacity to focus on clinically meaningful anatomical features rather than confounding visual elements.

Polyp classification performance (CADx)

Classification performance revealed a different hierarchy than detection, with CNNs substantially outperforming VLMs for fine-grained histological discrimination (Fig. 4). ResNet50 achieved the highest weighted F1 (74.94%), establishing a 20-percentage-point advantage over the best VLMs: GPT-4.1-mini (55.07%) and GPT-4.1 (54.74%). SVM was the only other model exceeding 55% (55.63%). Mid-tier performers included Random Forest (43.67%), Qwen-2.5-vl-72b (42.13%), GPT-4 (41.18%), Logistic Regression (40.32%), and Decision Tree (40.42%). Earlier VLMs and contrastive encoders showed weaker performance: Gemma-3-27b (35.50%), BiomedCLIP (27.74%), Claude-3-Opus (25.54%), Gemini-1.5-Pro (6.17%), and CLIP (1.69%). Notably, BiomedCLIP’s strong detection (88.68%) did not translate to classification (27.74%), suggesting zero-shot classification of subtle histological variants is substantially more challenging. Table 6 presents overall weighted F1 scores, while Supplementary TableS1 details performance by polyp type.

Tubular adenoma (TA) images (650 training, 121 test) achieved the most consistent classification performance across models. The best results were obtained by ResNet50 (F1: 0.85), followed by Support Vector Machine (F1: 0.68) and Random Forest (F1: 0.64). Among VLMs, GPT-4 (F1: 0.58) outperformed Claude-3-Opus (F1: 0.33). However, other recent VLMs such as Gemma-3-27B (F1: 0.48) and Qwen-2.5-VL-72B (F1: 0.57) showed weaker performance. Notably, the latest multimodal models, GPT-4.1 (F1: 0.71) and GPT-4.1-mini (F1: 0.73), narrowed the gap with CNN and CML methods, underscoring rapid progress in VLM-based polyp subtype recognition.

Adenocarcinoma (AC) images (66 training, 13 test) were best classified by GPT-4.1-mini (F1: 0.69), closely followed by ResNet50 (F1: 0.67); GPT-4.1 (F1: 0.61) trailed both. Among other models, BiomedCLIP (F1: 0.56) and SVM (F1: 0.45) performed reasonably, while tree-based methods were low (Decision Tree: 0.06; Random Forest: 0.00). Other VLMs were modest: GPT-4 (F1: 0.30), Qwen-2.5-VL-72B (F1: 0.25), Gemma-3-27B (F1: 0.24), Claude-3-Opus (F1: 0.19), Gemini-1.5-Pro (F1: 0.00).

Hyperplastic polyp (HP) images (116 training, 22 test) presented a challenging classification task. Among CML methods, SVM (F1: 0.31) and Decision Tree (F1: 0.22) outperformed Random Forest (F1: 0.08), Logistic Regression (F1: 0.07), and Gaussian Naive Bayes (F1: 0.07). The CNN ResNet50 achieved the highest overall performance with an F1 of 0.49, highlighting the strength of deep learning for this subtype. VLMs generally performed poorly: GPT-4 and GPT-4.1-mini (F1: 0.00), Gemini-1.5-Pro (F1: 0.00), while GPT-4.1 (F1: 0.14), Claude-3-Opus (F1: 0.14), Qwen-2.5-vl-72b (F1: 0.09), and Gemma-3-27b (F1: 0.05) performed slightly better. Among contrastive VLMs, BiomedCLIP (F1: 0.21) outperformed CLIP (F1: 0.04) but still lagged behind CNN and CML models.

The most challenging classifications were observed for tubulovillous adenoma (TVA, 48 training, 11 test), villous adenoma (VA, 30 training, 6 test), and inflammatory polyp (IP, 38 training, 7 test) images. For TVA, ResNet50 achieved the highest F1 of 0.55, with Decision Tree (F1: 0.27) and SVM (F1: 0.25) showing limited effectiveness. Most other models, including VLMs and contrastive VLMs, performed at or near random chance, except for BiomedCLIP (F1: 0.17), Gemma-3-27b (F1: 0.15), and GPT-4.1-mini (F1: 0.07), which provided small improvements. For VA, ResNet50 (F1: 0.25) was the only model with moderate performance; most other models failed, with minor gains from Claude-3-Opus (F1: 0.04), GPT-4.1 (F1: 0.20), and Qwen-2.5-vl-72b (F1: 0.09). For IP, ResNet50 (F1: 0.71) performed best, followed by SVM (F1: 0.36) and Logistic Regression (F1: 0.20), while most VLMs were ineffective, except GPT-4.1-mini (F1: 0.12) and GPT-4.1 (F1: 0.08); contrastive models CLIP and BiomedCLIP (F1: 0.04 each) contributed minimally.

Figure 4 displays confusion matrices for polyp classification utilizing Random Forest (CML’s top performer), ResNet50, GPT-4.1 (the leading VLM), and BiomedCLIP. Adenoma subtypes showed substantial confusion across all models, with tubulovillous and villous adenomas frequently misclassified as tubular adenomas. ResNet50 demonstrated the best discrimination but still showed considerable uncertainty. Complete ROC curves and confusion matrices for all models are in Supplementary Figures S3 and S4.

Polyp classification performance (CADx) on external validation dataset

External validation on 75 images from the PICCOLO database revealed varying performance degradation across model types. ResNet50 showed the largest decline (internal: 0.83, external: 0.49, Δ = -0.34), suggesting overfitting to institution-specific characteristics. VLMs demonstrated smaller drops: GPT-4.1-mini (0.75 to 0.59, Δ = -0.16), GPT-4.1 (0.72 to 0.58, Δ = -0.14), and Gemma-3-27B (0.72 to 0.53, Δ = -0.19). Notably, Qwen-2.5-vl-72B exhibited the smallest decline among high-performing models (0.66 to 0.61, Δ = -0.05), suggesting superior cross-institutional generalization. CML models showed intermediate degradation: SVM (0.69 to 0.52, Δ = -0.17), Logistic Regression (0.59 to 0.48, Δ = -0.11), Random Forest (0.63 to 0.53, Δ = -0.10), and Decision Tree (0.55 to 0.53, Δ = -0.02). Gaussian Naive Bayes showed apparent improvement (0.08 to 0.12, Δ = +0.04), likely reflecting statistical noise given its poor baseline. These results suggest that while CNN achieves superior internal performance, pretrained VLMs may offer generalization advantages. F1 scores are presented in Table 7, with confusion matrices provided in Supplementary Figure S5.

Table 7 Comparative analysis of machine learning models in polyp classification in external Dataset. Performance comparison of classical machine learning (CML) models, ResNet-50 and vision Language models (VLMs) for polyp classification tasks.

Full size table

Exploring Few-Shot injection impact on VLM prediction

Performance of Few-shot prompting produced heterogeneous effects for polyp detection (F1 scores in Table 6; confusion matrices in Supplementary Figure S6). Gemma-3-27B showed the largest improvement (F1: 0.69 to 0.81), followed by Qwen-2.5-VL-72B (F1: 0.69 to 0.75). GPT-4.1 exhibited only a marginal gain (F1: 0.92 to 0.93), suggesting near-optimal baseline performance, while GPT-4.1-mini experienced a slight decline (F1: 0.91 to 0.89).

Few-shot prompting also produced mixed effects on classification performance across models. While overall weighted F1 often declined (GPT-4.1: 0.55 to 0.43, GPT-4.1-mini: 0.55 to 0.49, Qwen-2.5-vl-72b: 0.42 to 0.36), certain underrepresented categories benefited substantially. For example, GPT-4.1-mini improved HP classification F1 score from 0.00 to 0.30, and Qwen-2.5-vl-72b increased AC from 0.25 to 0.35 and VA from 0.00 to 0.13. Gemma-3-27b also demonstrated consistent gains, raising weighted F1 from 0.36 to 0.38, with HP classification F1 score improving from 0.05 to 0.17 and TVA from 0.15 to 0.22. However, these improvements were often offset by declines in high-prevalence classes such as AC and TA (e.g., GPT-4.1 F1 score for AC: 0.61 to 0.47, TA: 0.71 to 0.52). This trade-off suggests few-shot learning requires careful calibration, as improvements for rare classes may come at the cost of common category accuracy.

Discussion

Our systematic evaluation established a performance hierarchy across computational paradigms. For polyp detection, the highest-performing zero-shot VLMs achieved parity with task-specific CNN. GPT-4.1 (F1: 91.98%) and GPT-4.1-mini (91.16%) performed comparably to ResNet50 (91.35%), demonstrating that frontier VLM architectures can match specialized CNN for binary classification tasks. The 11-percentage-point improvement from GPT-4 (81.02%) to GPT-4.1 within a single model generation suggests rapid architectural evolution, though proprietary models preclude definitive attribution. However, this performance was not universal across VLMs. Qwen-2.5-vl-72b (68.59%), Gemma-3-27b (69.29%), Claude-3-Opus (66.40%), and Gemini-1.5-Pro (19.37%) performed substantially worse, with some scoring at or below CMLs baselines (Random Forest: 81.02%, SVM: 77.92%). This 72-point performance range across VLMs (GPT-4.1: 91.98% to Gemini-1.5-Pro: 19.37%) underscores that VLM does not denote uniform capability, but rather encompasses architectures with markedly different medical imaging performance.

For polyp classification, even the highest-performing VLMs underperformed CNN. ResNet50 (weighted F1: 74.94%) substantially outperformed GPT-4.1-mini (55.07%), the best VLM for this task. This 20-point performance gap widened substantially for rare polyp subtypes, as detailed below. CML approaches consistently underperformed deep learning methods for both detection and classification, validating the shift toward neural architectures in medical imaging.

This detection-classification dichotomy likely reflects fundamental task differences. Polyp detection requires distinguishing abnormal mucosal protrusions from normal tissue based on features such as texture variations, color changes, and surface irregularities visible during endoscopy. VLMs’ broad pretraining on diverse visual domains may enable recognition of these general visual patterns. In contrast, polyp classification requires discrimination between subtle morphological variants visible on the polyp surface. Distinguishing different polyp classes based on colonoscopy images probably requires recognition of surface pit patterns, vascular patterns, color variations, shape characteristics, and surface texture that correlate with underlying histology^28,29. These domain-specific visual-histological correlations, likely absent from general pretraining datasets, may explain why VLMs struggle with fine-grained histological prediction despite achieving strong detection performance.

Performance on rare polyp types revealed the magnitude of this classification limitation. For TA (650 training images, 121 test images), GPT-4.1 and GPT-4.1-mini achieved 71–73% F1 for endoscopic histological prediction. However, performance declined substantially for rarer subtypes: VA (30 training, 6 test) both models ≤ 20% F1; TVA (48 training, 11 test) both ≤ 7% F1; IP (38 training, 7 test) both ≤ 12% F1. For HP (116 training, 22 test), both achieved 0% F1. In contrast, ResNet50 maintained non-zero performance across all categories: HP 49%, VA 25%, TVA 55%, IP 71%. Even CML models (SVM: 31% for HP) outperformed the leading VLMs on these categories. This pattern extends beyond simple class imbalance, as classical models trained on the same limited rare examples maintained non-zero performance. The findings suggest that zero-shot transfer, while effective for common polyp types with abundant visual similarity to general pretraining data, fails for rare histological presentations requiring domain-specific pattern recognition.

The substantial performance variability across VLMs noted above warrants investigation. These findings are consistent with emerging evidence from other clinical domains showing wide variability in VLM performance across medical imaging tasks^{30,31,32,33,34,35}. Several factors likely contribute. First, architectural differences across proprietary models affect visual-language integration. GPT-4.1-mini achieving nearly identical detection performance (91.16%) to GPT-4.1 (91.98%) despite presumably fewer parameters suggests architectural innovations rather than scale drive improvements. Second, pretraining data composition varies. BiomedCLIP (88.68% F1) substantially outperformed general CLIP (68.39%) for polyp detection as a result of its additional training on 15 million biomedical figure-caption pairs from PubMed Central¹¹, providing direct evidence that medical content exposure improves performance. General-purpose VLMs likely contain varying amounts of incidental medical imaging in their pretraining corpora, partially explaining performance differences. Third, instruction-following capability varies substantially, as demonstrated by our prompt engineering experiments.

Prompt engineering revealed substantial performance sensitivity. For polyp detection, improvements with engineered prompts ranged from 2.2% (GPT-4.1, Gemini-1.5-Pro) to 51.0% (Qwen-2.5-vl-72b). For classification, improvements were substantial: GPT-4.1 (15.6% to 59.4%, + 280.7%), GPT-4.1-mini (16.9% to 71.1%, + 320.7%), and Qwen-2.5-vl-72b (0.8% to 50.2%, + 6175%). These magnitudes underscore that systematic prompt design is critical for medical VLM deployment^17,36. Few-shot prompting showed variable effects. For detection, Gemma-3-27B improved substantially (+ 17.4%) while GPT-4.1 showed minimal gain (+ 1.1%), consistent with baseline performance near ceiling. GPT-4.1-mini declined slightly (-2.2%), suggesting few-shot examples may introduce noise for high-performing models. This outcome may also be attributed to our selection of examples: we primarily included clear and unambiguous cases that the model could process effectively, whereas its performance may decline when confronted with more ambiguous images. For classification, few-shot prompting often improved rare categories while reducing common category performance, yielding limited overall gains. Our results exceed previously reported prompt-dependent performance variations and reinforce that effective prompt engineering is critical for clinical VLM implementation^17,36. In addition, these findings reaffirm that prompt optimization benefits mid-performing models most, while top performers show diminishing returns^37,38,39.

Beyond internal performance patterns observed in our test set, cross-institutional generalization represents a critical consideration for clinical deployment. External validation on 75 images from the PICCOLO database assessed cross-institutional generalization. ResNet50 showed substantial performance decline (weighted F1: 0.83 to 0.49), potentially reflecting overfitting to institution-specific characteristics such as imaging equipment settings, acquisition protocols, or patient population differences. VLMs also experienced decreases, with GPT-4.1 (0.72 to 0.58), GPT-4.1-mini (0.75 to 0.59), and Gemma-3-27B (0.72 to 0.53) showing larger declines than Qwen-2.5-vl-72B (0.66 to 0.61). The relatively stable performance of some VLMs compared to ResNet50’s larger degradation may suggest that zero-shot models pretrained on diverse data possess some cross-domain robustness. However, our limited external sample (75 images, one institution, three polyp classes versus six in internal data) precludes definitive conclusions.

These performance characteristics, together with fundamental differences in computational requirements, have direct implications for clinical deployment strategies. Computational requirements differ fundamentally between model families with direct implications for clinical applicability. CNNs require dataset annotation, model training (several hours on our hardware for ResNet50), and validation testing. However, once deployed, CNNs enable rapid local inference (milliseconds per image on CPU) with zero recurring costs and no network dependencies. This computational profile makes CNNs suitable for real-time intra-procedural applications, where frame-by-frame analysis during endoscope advancement can provide immediate feedback to endoscopists. VLMs eliminate training requirements through zero-shot deployment, substantially reducing barriers to entry. However, current API-based VLMs introduce per-image costs and network latency (seconds per image in our implementation), making them unsuitable for real-time use during live procedures. Network dependencies also introduce reliability concerns. The computational profile of current API-based VLMs restricts them to retrospective applications such as post-procedure quality assurance, batch analysis of stored images, or second-opinion consultation on challenging cases.

These computational constraints shape institutional deployment decisions. Academic centers with AI infrastructure may favor CNN development for real-time applications despite upfront costs, benefiting from zero marginal inference costs and real-time deployment capability for both detection and optical diagnosis. Community practices lacking machine learning expertise might find API-based VLMs useful for retrospective quality assurance despite recurring costs, as zero-training deployment enables immediate adoption for post-procedure review. However, institutions seeking real-time procedural guidance must pursue CNN-based approaches given current technological constraints. The substantial performance gap for rare polyp classification further indicates that current-generation VLMs should not be relied upon for optical diagnosis decisions without further technological advancement.

Several immediate research directions emerge from these findings. First, evaluation on video colonoscopy sequences rather than still frames would assess temporal reasoning capabilities and enable analysis of dynamic polyp characteristics across multiple viewing angles. Second, expansion of external validation to additional institutions with diverse endoscopy equipment, patient populations, and polyps would better characterize cross-institutional generalization and identify specific factors affecting model transferability. Third, investigation of spatial localization capabilities, particularly for VLMs through region-specific prompting or coordinate generation, would address a critical requirement for clinical applicability. Fourth, our choice of examples for few-shot prompting may have influenced the results; therefore, future studies should explore alternative methods for example selection. Finally, systematic analysis of model performance stratified by polyp size, morphology, and location would reveal potential biases affecting clinical safety and identify subgroups requiring targeted algorithmic improvements.

Several methodological limitations should be considered. First, natural prevalence disparities influenced our dataset composition despite our augmentation efforts, potentially impacting model performance for several rare polyp categories. Second, our evaluation used still colonoscopy images rather than video sequences, eliminating temporal continuity, polyp motion tracking, and multi-angle visualization available during actual procedures. Third, our study focuses on polyp detection (presence/absence) and classification (histological type) rather than spatial localization, which would be necessary for complete clinical implementation. Fourth, our external validation provides initial cross-institutional evidence but represents a small sample from a single additional institution with three polyp classes compared to our internal dataset’s six classes. Larger-scale multi-institutional validation is necessary to establish robust generalizability benchmarks.

Conclusion

This systematic comparison of VLM and CNN for colonoscopy polyp analysis reveals a clear task-dependent performance hierarchy. While the highest-performing VLMs matched CNNs for binary polyp detection, CNNs maintained substantial advantages for polyp classification, particularly for rare polyp subtypes where VLMs failed entirely. These findings suggest that current zero-shot VLMs may serve retrospective quality assurance roles but remain unsuitable for real-time clinical deployment requiring histological discrimination. Computational constraints further restrict API-based VLMs to post-procedure applications, while CNNs enable real-time intra-procedural guidance. As both architectural families continue to evolve, understanding their complementary strengths and limitations will inform appropriate deployment strategies across diverse clinical settings.

Data availability

The datasets created and analyzed in this study cannot be accessed publicly due to IRB requirements; however, anonymized data can be obtained from the corresponding author (HAA) and SAASN ( [sdamirsa@gmail.com]) upon request by providing the IRB code. The external dataset is accessible after signing data transfer agreement from [https://www.biobancovasco.bioef.eus/].The code for the generation and evaluation of responses is publicly available at: [https://github.com/aminkhalafi/CML-vs-LLM-on-Polyp-Detection].

Abbreviations

CML:: Classical machine learning
VLM:: Vision language model
cVL:: contrastive vision-language encoders
AC:: Adenocarcinoma
TA:: Tubular adenoma
TVA:: Tubulovillous adenoma
VA:: Villous adenoma
HP:: Hyperplastic polyp
IP:: Inflammatory polyp

References

Leufkens, A. M., van Oijen, M. G. H., Vleggaar, F. P. & Siersema, P. D. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44, 470–475 (2012).
Article PubMed Google Scholar
Kim, N. H. et al. Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intest Res. 15, 411–418 (2017).
Article PubMed PubMed Central Google Scholar
Rizkala, T., Menini, M., Massimi, D. & Repici, A. Role of artificial intelligence for colon polyp detection and diagnosis and colon cancer. Gastrointest. Endosc. Clin. N. Am. 35, 389–400 (2025).
Article PubMed Google Scholar
Pacal, I. et al. An efficient real-time colonic polyp detection with YOLO algorithms trained by using negative samples and large datasets. Comput. Biol. Med. 141, 105031 (2022).
Article PubMed Google Scholar
Karaman, A. et al. Robust real-time polyp detection system design based on YOLO algorithms by optimizing activation functions and hyper-parameters with artificial bee colony (ABC). Expert Syst. Appl. 221, 119741 (2023).
Article Google Scholar
Karaman, A. et al. Hyper-parameter optimization of deep learning architectures using artificial bee colony (ABC) algorithm for high performance real-time automatic colorectal cancer (CRC) polyp detection. Appl. Intell. 53, 15603–15620 (2023).
Article Google Scholar
Pacal, I. & Karaboga, D. A robust real-time deep learning based automatic polyp detection system. Comput. Biol. Med. 134, 104519 (2021).
Article PubMed Google Scholar
Ince, S., Kunduracioglu, I., Algarni, A., Bayram, B. & Pacal, I. Deep learning for cerebral vascular occlusion segmentation: A novel ConvNeXtV2 and GRN-integrated U-Net framework for diffusion-weighted imaging. Neuroscience 574, 42–53 (2025).
Article PubMed Google Scholar
Narasimha Raju, A. S. et al. CADxPolydetect: a clinically explainable hybrid deep learning system for multi-class colorectal lesion detection using augmented colonoscopy images. BMC Med. Inf. Decis. Mak. 25, 335 (2025).
Article Google Scholar
Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. (2021).
Zhang, S. et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. Preprint at (2025). https://doi.org/10.48550/arXiv.2303.00915
OpenAI, Achiam, J. & Adler, S. & others. GPT-4 Technical Report. (2024).
The Claude 3 Model Family. Opus, Sonnet, Haiku. in (2024).
Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (2024).
Pillai, A., Parappally, B. S. & Hardin, M. J. Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology. in medRxiv (2024). https://doi.org/10.1101/2024.01.24.24301743
Laohawetwanit, T., Namboonlue, C. & Apornvirat, S. Accuracy of GPT-4 in histopathological image detection and classification of colorectal adenomas. J. Clin. Pathol. 78, 202–207 (2025).
Article PubMed Google Scholar
Chen, R. et al. GPT-4 Vision on Medical Image Classification -- A Case Study on COVID-19 Dataset. Preprint at (2023). https://doi.org/10.48550/ARXIV.2310.18498
Han, T. et al. Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise. Preprint at (2023). https://doi.org/10.1101/2023.11.03.23297957
Xu, P., Chen, X., Zhao, Z. & Shi, D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br. J. Ophthalmol. 108, 1384–1389 (2024).
Article PubMed Google Scholar
Yang, Z. et al. Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations. Preprint at (2023). https://doi.org/10.1101/2023.10.26.23297629
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. Npj Digit. Med. 7, 190 (2024).
Article PubMed PubMed Central Google Scholar
Klement, W. & Emam, K. E. Consolidated reporting guidelines for prognostic and diagnostic machine learning modeling studies: development and validation. J. Med. Internet. Res. 25, e48763 (2023).
Article PubMed PubMed Central Google Scholar
Collins, G. S. et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Article PubMed PubMed Central Google Scholar
Haumaier, F., Sterlacci, W. & Vieth, M. Histological and molecular classification of Gastrointestinal polyps. Best Pract. Res. Clin. Gastroenterol. 31, 369–379 (2017).
Article PubMed Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016). 770–778 (2016). (2016). https://doi.org/10.1109/CVPR.2016.90
OpenAI. GPT-4V(ision) System Card. in. (2023).
Safavi-Naini, S. A. A. et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. Preprint at (2024). https://doi.org/10.48550/ARXIV.2409.00084
Sánchez-Montes, C. et al. Computer-aided prediction of polyp histology on white light colonoscopy using surface pattern analysis. Endoscopy 51, 261–265 (2019).
Article PubMed Google Scholar
Li, M. et al. Kudo’s pit pattern classification for colorectal neoplasms: a meta-analysis. World J. Gastroenterol. 20, 12649–12656 (2014).
Article PubMed PubMed Central Google Scholar
Schmidl, B. et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. Eur. Arch. Otorhinolaryngol. 281, 6099–6109 (2024).
Article PubMed PubMed Central Google Scholar
Nguyen, C., Carrion, D. & Badawy, M. K. Comparative performance of anthropic Claude and openai GPT models in basic radiological imaging tasks. J. Med. Imaging Radiat. Oncol. 69, 431–439 (2025).
Article PubMed Google Scholar
Ishida, M. et al. Diagnostic performance of GPT-4o and Claude 3 opus in determining causes of death from medical histories and postmortem CT findings. Cureus 16, e67306 (2024).
PubMed PubMed Central Google Scholar
Liu, X. et al. Claude 3 opus and ChatGPT with GPT-4 in dermoscopic image analysis for melanoma diagnosis: comparative performance analysis. JMIR Med. Inf. 12, e59273 (2024).
Article Google Scholar
Liu, M. et al. Evaluating the effectiveness of advanced large Language models in medical knowledge: A comparative study using Japanese National medical examination. Int. J. Med. Inf. 193, 105673 (2025).
Article Google Scholar
Chen, Z. et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 87, 1041–1049 (2025).
Article PubMed Google Scholar
Patil, R., Heston, T. F. & Bhuse, V. Prompt engineering in healthcare. Electronics 13, 2961 (2024).
Article Google Scholar
Chatterjee, A., Renduchintala, H. S. V. N. S. K., Bhatia, S. & Chakraborty, T. P. O. S. I. X. A Prompt Sensitivity Index For Large Language Models. Preprint at (2024). https://doi.org/10.48550/ARXIV.2410.02185
Zhuo, J. et al. ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs. Preprint at (2024). https://doi.org/10.48550/ARXIV.2410.12405
Razavi, A. et al. Benchmarking Prompt Sensitivity in Large Language Models. Preprint at (2025). https://doi.org/10.48550/ARXIV.2502.06065

Download references

Acknowledgements

The authors utilized ChatGPT, Claude, and Grammarly to assist with manuscript editing. All authors have reviewed the manuscript and take full responsibility for its content.We want to particularly acknowledge the patients and the Basque Biobank integrated in the Platform ISCIII Biomodels and Biobanks (PT23/00013) for their collaboration.

Author information

Mohammad Amin Khalafi and Seyed Amir Ahmad Safavi-Naini contributed equally to this work.

Authors and Affiliations

Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Mohammad Amin Khalafi, Seyed Amir Ahmad Safavi-Naini, Ameneh Salehi, Nariman Naderi, Dorsa Alijanzadeh, Pardis Ketabi Moghadam, Shabnam Shahrokh & Hamid Asadzadeh Aghdaei
Division of Data-Driven and Digital Health (D3M), The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Seyed Amir Ahmad Safavi-Naini, Negar Golestani, Girish Nadkarni & Ali Soroush
The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Seyed Amir Ahmad Safavi-Naini, Girish Nadkarni & Ali Soroush
Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
Kaveh Kavousi
Department of GI Diseases, Tehran Milad Hospital, Tehran, Iran
Soltanali Fallah
Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Jamil S. Samaan
Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA
Nicholas P. Tatonetti
Cedars-Sinai Cancer, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, USA
Nicholas P. Tatonetti
Department of Biomedical Informatics, Columbia University, New York, NY, USA
Nicholas P. Tatonetti
Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Nicholas Hoerter & Ali Soroush

Authors

Mohammad Amin Khalafi
View author publications
Search author on:PubMed Google Scholar
Seyed Amir Ahmad Safavi-Naini
View author publications
Search author on:PubMed Google Scholar
Ameneh Salehi
View author publications
Search author on:PubMed Google Scholar
Nariman Naderi
View author publications
Search author on:PubMed Google Scholar
Dorsa Alijanzadeh
View author publications
Search author on:PubMed Google Scholar
Pardis Ketabi Moghadam
View author publications
Search author on:PubMed Google Scholar
Kaveh Kavousi
View author publications
Search author on:PubMed Google Scholar
Negar Golestani
View author publications
Search author on:PubMed Google Scholar
Shabnam Shahrokh
View author publications
Search author on:PubMed Google Scholar
Soltanali Fallah
View author publications
Search author on:PubMed Google Scholar
Jamil S. Samaan
View author publications
Search author on:PubMed Google Scholar
Nicholas P. Tatonetti
View author publications
Search author on:PubMed Google Scholar
Nicholas Hoerter
View author publications
Search author on:PubMed Google Scholar
Girish Nadkarni
View author publications
Search author on:PubMed Google Scholar
Hamid Asadzadeh Aghdaei
View author publications
Search author on:PubMed Google Scholar
Ali Soroush
View author publications
Search author on:PubMed Google Scholar

Contributions

MAK: Conceptualization, Methodology, Software, Investigation, Data Curation, Writing - Original Draft, Project administration, Visualization. SAASN: Methodology, Software, Writing - Original Draft, Project administration. AmSa: Investigation, Writing - Review & Editing. NN: Software, Writing - Review & Editing. DA: Validation, Writing - Original Draft, Visualization. PKM: Conceptualization, Data Curation. KK: Methodology, Supervision. NG: Methodology. SF: Validation, Writing - Review & Editing. SS: Investigation, Validation, Writing - Review & Editing. JSS: Investigation, Writing - Review & Editing. NPT: Validation, Writing - Review & Editing. NH: Writing - Review & Editing. GN: Validation, Resources, Writing - Review & Editing. HAA: Conceptualization, Methodology, Resources, Writing - Review & Editing, Supervision. AlSo: Conceptualization, Methodology, Validation, Resources, Writing - Original Draft.

Corresponding authors

Correspondence to Hamid Asadzadeh Aghdaei or Ali Soroush.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Khalafi, M.A., Safavi-Naini, S.A.A., Salehi, A. et al. Vision language models versus machine learning models performance on polyp detection and classification in colonoscopy images. Sci Rep 15, 45484 (2025). https://doi.org/10.1038/s41598-025-29566-2

Download citation

Received: 28 March 2025
Accepted: 18 November 2025
Published: 27 November 2025
Version of record: 30 December 2025
DOI: https://doi.org/10.1038/s41598-025-29566-2

Subjects

Abstract

Introduction

Methods

Ethical consideration

Experimental framework

Dataset - Characteristics

Patient population and data collection

Image collection and histopathological assessment

External dataset for validation

Image preprocessing and data augmentation

Model development and configuration

Classical machine learning approaches

Convolutional neural network: Resnet 50

Contrastive multimodal encoders

General-Purpose vision Language models

Exploring Few-Shot injection impact on General-Purpose vision Language models

Performance evaluation

Statistical analysis

TiLense: importance of tiles for vlm’s Zero-Shot polyp detection

Libraries and local computing

Results

Model optimization

Polyp detection performance (CADe)

Polyp classification performance (CADx)

Polyp classification performance (CADx) on external validation dataset

Exploring Few-Shot injection impact on VLM prediction

Discussion

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links