Introduction

Headaches are a common condition affecting a substantial portion of the global population1. Early detection and accurate diagnosis are essential for effective management and treatment. With advancements in modern imaging techniques and artificial intelligence (AI) technologies, there is an increasing ability to develop automated headache detection systems to support clinicians in making precise diagnoses. In automated headache detection, supervised machine learning algorithms have been increasingly utilized to identify and classify various types of headaches, including migraine2. These methods rely on labeled (annotated) datasets to learn the complex patterns that distinguish headache conditions from normal cases. By integrating diverse patient data including clinical records, machine learning approaches can directly associate patient characteristics with headache diagnoses. Messina and Filippi3 reviewed how machine learning methodologies have been used in headache research. Primarily for migraine, machine learning algorithms have leveraged patient-provided information along with brain imaging data to differentiate those with migraine from HC and subtype those with migraine according to differing symptoms4,5.

Deep learning techniques have further advanced this field by enabling comprehensive analysis of whole brain imaging data, such as MRI scans. Such methods are now widely used for different tasks such as brain tumor detection6,7 anomaly detection8,9 metastases detection10. Convolutional Neural Networks (CNNs)11including architectures, such as ResNets, have been used successfully with multimodal MRI data for migraine classification11,12. Recently, Siddiquee et al.5 showed that a ResNet-18-based 3D classifier demonstrated promising performance in classifying images into categories such as HC, migraine (Mig), acute post-traumatic headache (APTH), and persistent post-traumatic headache (PPTH). Nevertheless, a significant challenge in developing such systems is the limited availability of imaging data and their annotations, particularly for less common headache subtypes.

Conventional supervised deep learning methods require large quantities of labeled data to guide the model training. Because gathering extensive annotated medical data is both time-intensive and costly, alternative approaches have been developed. Alzubaidi et al.13 presented a comprehensive list of tools to address the issue of data scarcity in deep learning, including transfer learning, self-supervised learning, deep synthetic minority oversampling, and other techniques. Transfer learning leverages pre-trained models and fine-tunes them on available medical data, substantially reducing the need for large-scale annotations. In addition, self-supervised learning trains models on unlabeled data by solving auxiliary tasks, which helps the model learn useful representations without needing large, labeled datasets. Both approaches can potentially alleviate the challenges of limited annotated data by reducing the reliance on large, labeled datasets. However, in medical imaging, traditional pre-trained models such as those developed on ImageNet14 are often suboptimal due to domain differences (e.g., ImageNet is trained on natural image dataset). A medical image pre-trained model may provide advantages for medical tasks with significantly fewer samples than those pre-trained on ImageNet15. However, significant challenge exists in developing medical imaging pre-trained models due to the lack of large volume of annotated medical images. Training Vision Transformers (ViTs) from scratch on limited data is known to result in poor generalization, largely due to their lack of inherent inductive biases such as spatial locality and hierarchical feature representation16. An innovative emerging idea is to leverage text and images in a multimodal fashion and construct a contrastive learning problem which is annotation-free.

CLIP17 (Contrastive Language-Image Pretraining) is a multimodal AI model developed by OpenAI that can understand images and text together. Pre-trained models using CLIP thus can serve as foundation models tailored to tasks with smaller datasets without the need for annotations. In this study, we explored using the pre-trained CLIP based model for headache classification when having access to a relatively limited number of brain MRIs. The CLIP model comprises an image encoder, implemented as a Vision Transformer (ViT)18 and a text transformer19 encoder, which is trained jointly on image-text pairs using a contrastive learning approach. We utilize BioMedCLIP20a foundational model based on the CLIP framework that incorporates ViT-B (base ViT model) as the image encoder with ~ 86 million parameters, and PubMedBERT21 as the text encoder. As its name suggests, this foundation model is specifically trained for biomedical applications. It is pre-trained on the PMC-15 M dataset, which contains 15 million biomedical image-text pairs, using the CLIP framework with contrastive learning. While this foundation model was first released through archival in March 2023, it was published in the New England Journal of Medicine AI in Dec. 2024. During the period, researchers already explored the advantages of contrastive language-image pretraining in medical applications including hematological image classification22 differentiating between normal and malignant cells in digital pathology23. Notably, the applications make use of both images and text.

Although BioMedCLIP is a multimodal model combining ViT and PubMedBERT, in this study we exclusively use the ViT image encoder component only for MRI-based headache classification, due to absence of textual annotations in MRI datasets. To address this, our study uniquely utilizes BiomedCLIP’s image encoder independently for fine-tuning on our institutional brain MRI dataset. The image encoder of this foundation model captures intricate biomedical patterns from extensive pre-training on millions of image-text pairs. These learned representations inherently carry relevant anatomical insights beneficial for diagnostic tasks without explicit textual data. Thus, exclusively leveraging the pre-trained image encoder strategically maximizes transferable biomedical knowledge. By leveraging pre-trained image encoder, we can effectively fine-tune the model on our relatively small dataset of T1-weighted brain MRIs for headache classification. This strategy not only circumvents the limitations of scarce annotated data but also harnesses the deep, contextual understanding embedded in the model, thereby enhancing performance in specialized diagnostic applications.

Our dataset consists of individuals with three headache types, Migraine, APTH, PPTH, and HC. Migraine, a primary headache type, affects approximately 14% of the general population24,25 and is a leading cause of disability. Migraine is one of the most common headache types among patients seeking care in outpatient clinics. PTH (APTH and PPTH), a secondary headache type, is a headache attributed to a traumatic injury to the head. In this study, we included participants who had PTH due to mild traumatic brain injuries (mTBI) and who had headache persistence for 3 months or less (APTH) or for longer than 3 months (PPTH). We fine-tuned the BioMedCLIP model on an MRI dataset of 721 subjects including 424 from IXI public dataset, and 297 participants imaged at Mayo Clinic, including Migraine, APTH, PPTH, and HC. Specifically, we only used the pre-trained image encoder, a ViT model, from the foundation model and fine-tuned it further using cross-entropy loss to adapt it for headache classification using 3D MRI data. We propose a novel slice-based multiple-instance learning evaluation strategy to assess our model’s performance. We employed Grad-CAM26 (Gradient-weighted Class Activation Mapping), a model explanation technique to identify the brain regions associated with different headache types. This method enabled us to extract critical biomarkers that contribute to the model’s classification decisions for Migraine, APTH, and PPTH. One reason we chose to include 424 IXI subjects in this study is to support comparison study to the published work26 to demonstrate the advantages of foundation model. We demonstrate that the pre-trained model with fine-tuning considerably outperforms models trained from scratch by comparing the classification metrics and biomarkers derived from the two approaches. In summary, we make the following contributions in this study:

  • We propose a method to fine-tune the pre-trained image encoder (ViT) from the multimodal foundation model for headache classification using structural brain MRI. By leveraging its prior training on millions of biomedical image–text pairs, we transfer rich domain-specific representations to a single-modality MRI task, despite the absence of paired text annotations.

  • Our proposed model demonstrates superior and consistent accuracy across multiple cross-validation folds compared to training a model from scratch given same imaging data.

  • We identify potential biomarkers by extracting and ranking brain regions significantly influencing the model’s predictions, thereby providing insights into neuroanatomical correlations of headache phenotypes.

Materials and methods

Dataset

We collected T1-weighted structural MRI data from 96 individuals with Migraine, 48 with APTH, and 49 with PPTH, diagnosed in accordance with the diagnostic criteria of the International Classification of Headache Disorders (ICHD)27,28. Participants with APTH and PPTH were enrolled at the Mayo Clinic Arizona or the Phoenix Veterans Administration, but all participants had MRI at the Mayo Clinic Arizona. Additionally, we included 104 HC from the Mayo Clinic and extended our dataset incorporating MRI scans of 424 HC from the publicly available IXI dataset18. For each data set, the information of participants is summarized in (Table 1).

Table 1 Headache dataset details.

Institutional data set: participant enrollment and characteristics

This study was approved by the Mayo Clinic Institutional Review Board (IRB), the Phoenix Veterans Administration IRB, and the United States Department of Defense Human Research Protection Office, and all participants provided written informed consent for their participation. The authors declare that all the methods in this article were performed in accordance with the relevant guidelines and regulations in the editorial and publishing policies of Scientific Reports (https://www.nature.com/srep/journal-policies/editorial-policies#experimental-subjects). At the time of enrollment, migraine participants were diagnosed with episodic or chronic migraine, with or without aura, based on the most recent edition of the International Classification of Headache Disorders (ICHD-3 beta or ICHD-3). Participants with APTH or PPTH had PTH attributed to mTBI according to the latest ICHD criteria (ICHD-3 beta or ICHD-3). Individuals with a history of moderate or severe traumatic brain injury were excluded from the study. Participants with APTH were enrolled between 0 and 59 days following mTBI, while those with PPTH were enrolled at any point after they had PTH for longer than three months. HC were excluded if they had a history of TBI or any headache type other than infrequent tension-type headache.

Image acquisition was performed using 3 Tesla Siemens scanners (Siemens Magnetom Skyra, Erlangen, Germany) equipped with a 20-channel head and neck coil. Anatomical T1-weighted images were captured using magnetization-prepared rapid gradient echo (MPRAGE) sequences. The imaging parameters for acquiring T1-weighted images were as follows: repetition time (TR) = 2400 ms, echo time (TE) = 3.03 ms, flip angle (FA) = 8°, and voxel size = 1 × 1 × 1.25 mm³.

The participants with Migraine had a mean age of 39.9 years (\(\:\pm\:\)11.6) and 74.7% were female. They had a mean headache frequency of 15.3 days per month; 37 had episodic migraine and 59 had chronic migraine. Additionally, 49 patients reported experiencing aura with at least some of their migraine attacks. The PTH group included 48 participants with APTH, with a mean age of 41.6 years ( ± 12.7) and 60.4% were female. Individuals with APTH experienced their first PTH symptoms an average of 24.5 days (± 14.5) prior to imaging. They reported headaches on 76.3% (±29.6%) of days following the onset of APTH. Their mTBIs were due to motor vehicle accidents (n = 20), falls (n = 21), and direct hits to the head (n = 7). The dataset also included 49 patients with PPTH. The mean age of the PPTH patients was 38.1 years (±10.5 years), and 34.7% were female. PPTH participants experienced headaches on an average of 15.3 days (±7.4 days) per month and they experienced PPTH for an average of 10.8 ± 8 years. The mTBIs leading to PPTH were attributed to various causes, including blast injuries (n = 22), falls (n = 12), sports-related injuries (n = 8), and motor vehicle accidents (n = 7).

Public IXI data set: participant enrollment and characteristics

The IXI dataset, which is an integral part of the “Information eXtraction from Images” project (EPSRC GR/S21533/02)29, is a comprehensive compilation of MRIs from healthy individuals. It was acquired between June 2005 and December 2006 to facilitate medical imaging and neuroimaging research. In addition to Magnetic Resonance Angiography (MRA) images and diffusion-weighted images (DWI) with 15 gradient orientations, the dataset comprises T1, T2, and Proton Density (PD) weighted images. The data were collected at three hospitals in London, each of which utilized a distinct MRI system to capture a variety of imaging parameters. Hammersmith Hospital utilized a Philips 3T scanner (TR = 9.6, TE = 4.6, acquisition matrix = 208 × 208, FA = 8°), Guy’s Hospital employed a Philips 1.5T scanner (TR = 9.813, TE = 4.603, FA = 8°), and the Institute of Psychiatry utilized a GE 1.5T scanner. The images are supplied in the Neuroimaging Informatics Technology Initiative (NIFTI) format, which simplifies their utilization in a variety of imaging software. In total, 277 male and 342 female participants were enrolled. Our final cohort contains 194 male and 230 female subjects with an average age of 42.4 years (± 13.0 years), matching the average ages of the Mayo Clinic dataset. In total, the dataset includes 721 subjects: 424 HC from IXI, and 297 local subjects (96 Migraine, 48 APTH, 49 PPTH, 104 HC).

Image preprocessing

The preprocessing pipeline for the Brain MRI dataset involved several standardized image processing steps to enhance data uniformity and facilitate accurate analysis. Images obtained from the Mayo Clinic were collected in DICOM format and later converted to NIFTI format. Initially, brain extraction was carried out using the Brain Extraction Tool (BET)30 from the FMRIB Software Library (FSL) (Wellcome Centre, University of Oxford, UK) BET efficiently isolates cerebral structures by removing extraneous non-brain tissues, resulting in skull-stripped MR images. This critical step normalizes image contents across the combined datasets, effectively mitigating biases from varying scanners, acquisition protocols or batch effects in both public IXI and institutional dataset. Following these steps, linear image registration was performed using the FMRIB’s Linear Image Registration Tool (FLIRT)31 also from FSL. FLIRT was applied to align individual MR images to the Montreal Neurological Institute (MNI) standard brain atlas (MNI152 1 mm). This normalization step involves linear affine transformations to match the anatomical features and spatial orientation of each subject’s brain to the standardized template. The registration effectively corrects for variations in brain size, shape, and orientation, thus further normalizing the images and enabling robust inter-subject comparisons. Freesurfer was subsequently utilized to execute White Matter Parcellation (wmparc), generating parcellation masks while excluding 14 irrelevant regions, including the left vessel, right vessel, right lateral ventricle, left lateral ventricle, right unsegmented white matter, left unsegmented white matter, left choroid plexus, right choroid plexus, left inferior lateral ventricle, right inferior lateral ventricle, fourth ventricle, third ventricle, cerebral spinal fluid (CSF) and optic chiasm. The resulting preprocessed images, standardized through these rigorous methods, provided a robust dataset suitable for fine-tuning the pre-trained BioMedCLIP ViT model for accurate headache classification. Finally, 3D MRIs were converted into 2D sagittal slices for model fine-tuning and evaluation, ensuring consistency and precision in downstream analysis. To reduce computational complexity and facilitate transfer learning with ViT, we converted 3D MRI scans into 2D sagittal slices, as ViT models are optimized for 2D image inputs.

Headache classification and automatic biomarker extraction

Our method involves fine-tuning the pre-trained ViT image encoder from the BioMedCLIP foundation model, which is originally trained using a contrastive learning framework that leverages both image and text embeddings from its respective encoders. The model learns in a contrastive manner by maximizing the similarity score of the same image-text pair in the dataset and minimizing the similarity score for the unpaired image-texts in the dataset. This process is illustrated in (Fig. 1). In this study, we utilized the pre-trained vision transformer image encoder from the foundation model and fine-tuned it for the headache classification task. The fine-tuning process employs a cross-entropy loss function in a supervised learning framework. We fine-tuned the ViT model for three binary classification tasks: HC vs. Migraine, HC vs. APTH, and HC vs. PPTH, with the goal of extracting subtyping biomarkers. To prepare the data for fine-tuning and evaluation, we preprocessed the 3D MRI scans by slicing them into 2D sagittal slices for each patient.

Fig. 1
figure 1

Overview of the BioMedCLIP pre-training framework, utilizing a contrastive learning approach. The process involves fine-tuning a pre-trained vision transformer (ViT) as the image encoder, along with a parallel text encoder. During training, image and text embeddings are generated, and the similarity scores between matching image-text pairs (highlighted diagonally) are maximized, while the similarity scores of non-matching pairs (off-diagonal) are minimized. This strategy ensures the alignment of visual and textual representations. The image included in this figure is for illustrative purposes only and was not actually included in this study.

The dataset was randomly divided into three subsets: training, validation, and blind testing (Table 2). The validation set was used to identify the best-performing deep learning model, while the blind testing set was employed to evaluate the robustness and generalization of the model on unseen data. We executed five-fold cross-validation for each task and reported the average and standard deviation results. For each fold, we maintained the same number of patients for training, validation, and testing as specified in Table 2; however, each distinct fold utilized different test and validation samples, ensuring no overlap between the folds for model evaluation. For evaluation, we followed the multi-instance learning32,33 scheme, a variation of supervised learning. In this scheme, multiple patches of the subject’s image are considered as bag of instances and assigned to the same label. We employed a slice-based Multi-Instance Learning evaluation process where slice-level embeddings are extracted for each patient. We extracted each slice from the 3D MRI image, and for each slice, we derived the embeddings after passing the slices through the image encoder and converted them into probabilities, indicating the likelihood of being healthy or having headache. After that, we aggregated the slice-level scores by averaging them to make patient-level classification predictions. Notably, while the training set is imbalanced, we applied oversampling method to overcome this. The test set is kept balanced to facilitate a robust evaluation with five-fold cross validation. In addition to standard classification metrics, we report precision, specificity, and sensitivity to provide a comprehensive assessment of model performance and account for class imbalance.

Table 2 Data split details for different classification tasks.

Once the classifier was trained for each task, to identify brain regions associated with different headache phenotypes, we utilized Gradient-weighted Class Activation Mapping (Grad-CAM) on the MRI images. Grad-CAM was applied to the final attention block of the fine-tuned ViT model. The resulting activation maps were aggregated across slices and projected onto anatomical regions to compute regional importance scores. For each patient, we first applied Grad-CAM to generate an activation map highlighting spatial locations contributing significantly to the classification decision. Next, we mapped the activation scores onto predefined brain regions by taking the maximum Grad-CAM activation score within each specific region, effectively assigning each region a representative activation value. We repeated this step for all patients in our dataset and then averaged these regional activation scores across patients to obtain patient-level consensus activation maps. Additionally, we executed this procedure separately for each of the five cross-validation folds, deriving regional activation maps from five independently trained models. Finally, we computed the average regional activation across these five folds, thereby ensuring robust and stable identification of regions consistently implicated in headache classification. The resulting regions were ranked according to their averaged activation scores, and the top activated regions were selected as the most salient biomarkers associated with headache conditions. The overall process is visualized in (Fig. 2).

Fig. 2
figure 2

Overview of fine-tuning and biomarker extraction pipeline using the pre-trained ViT from BiomedCLIP. APTH = acute post-traumatic headache; HC = healthy control; Mig = migraine; PPTH = persistent post-traumatic headache.

Results

The fine-tuned model’s performance and the corresponding brain regions identified via Grad-CAM are detailed below. We summarize the three binary classification results, including accuracy, precision, specificity, and sensitivity in (Table 3). Each case is further discussed in the following subsections with the extracted biomarkers. We performed five-fold cross validation for each binary classification task, each with balanced non-overlapping testing and oversampled training sets and reported the mean along with the standard deviation of accuracy, precision, specificity and sensitivity.

Table 3 Summary of results achieved from the models for three classification tasks.

Migraine classification

On the blind five-fold test sets, the model achieved an average accuracy of 89.96% (± 8.10%), with a precision of 85.71% (± 9.10%), sensitivity of 96.66% (± 6.60%), and specificity of 83.33% (± 10.54%) for the HC versus Migraine classification.

Key brain regions contributing to the model’s classification decisions were identified using Grad-CAM activation score rankings. These regions are visualized in (Fig. 3). Significant regions included the postcentral cortex, postcentral white matter, supramarginal cortex, supramarginal white matter, superior temporal cortex, superior temporal white matter, inferior parietal white matter, inferior parietal cortex, superior parietal cortex, superior parietal white matter, precuneus cortex, precuneus white matter, banks of the superior temporal sulcus white matter, hippocampus, middle temporal cortex, cerebellar cortex, and lingual white matter.

Fig. 3
figure 3

Key brain regions identified by GradCAM activation score for HC vs. Migraine. [wm: white matter; ctx: cortex; post: posterior.]

Acute post-traumatic headache classification

For the classification task distinguishing APTH from HC, we achieved an accuracy of 88.13% (± 6.10%), with 88.33% (± 9.00%) precision, 89.99% (± 7.50%) sensitivity, and 86.53% (± 11.58%) specificity on the five blind test sets.

Key brain regions contributing to the model’s classification decisions were identified using GradCAM activation scores. Significant regions included the rostral-middle frontal cortex, rostral-middle frontal white matter, precentral cortex, postcentral cortex, supramarginal cortex, superior temporal cortex, superior temporal white matter, middle temporal cortex, middle temporal white matter, lateral occipital cortex, lingual cortex, inferior parietal cortex, pars opercularis cortex, superior frontal cortex, inferior temporal cortex, inferior temporal white matter, fusiform cortex, cerebellar cortex, and cerebellar white matter. These regions are visualized in (Fig. 4).

Fig. 4
figure 4

Key brain regions identified by GradCAM activation score for HC vs. APTH. [wm: white matter; ctx: cortex; post: posterior.]

Persistent post-traumatic headache classification

The model achieved an accuracy of 83.13% (± 5.20%), with 80.80% (± 10.36%) precision, specificity of 76.20% (± 13.60%) and sensitivity of 93.26% (± 8.20%) for HC versus PPTH classification task.

Significant brain regions contributing to the model’s classification decisions, as identified by GradCAM activation scores, included rostral middle frontal cortex, precentral cortex, precentral white matter, postcentral cortex, postcentral white matter, supramarginal white matter, supramarginal cortex, superior temporal cortex, inferior parietal cortex, superior parietal white matter, superior parietal cortex, lateral occipital cortex, lateral occipital white matter, precuneus cortex, middle temporal cortex, lingual cortex, cerebellar cortex. These regions are visualized in (Fig. 5).

Fig. 5
figure 5

Key brain regions identified by GradCAM activation score for HC vs. PPTH. [wm: white matter; ctx: cortex; post: posterior.]

Discussion

One of the primary challenges in medical imaging-based downstream tasks, e.g., biomarker extraction, is the scarcity of large, annotated imaging datasets. Most previous deep learning methods require task-specific labeled training data to extract useful image features. However, a single-modal image model trained from scratch on a relatively small dataset could be limited in extracting robust disease-related image features. Multi-modal models trained on large-scale image-text pairs34 have been shown to learn a robust understanding of visual concepts guided by natural language descriptions, allowing them to discriminate between categories they haven’t explicitly seen before. This task-agnostic large-scale training to extract semantically meaningful image representations allows the model to adapt to different downstream applications with fine-tuning. In this study, we used headache disorders as a case example for testing our methods. Prior studies had demonstrated differences in brain structure amongst those with migraine or PTH compared to HC, making our headache dataset useful for such testing. In headache classification, since access to labeled datasets and clinically verified biomarkers is rare, we hypothesized that leveraging a pre-trained multi-modal model may be a viable solution for extracting robust headache biomarkers.

The effectiveness of BiomedCLIP stems from its large-scale training on image-text pairs tailored for biomedical applications. It captures a more complete picture of the data by integrating complementary information from various sources (text and images here), which also loosely mimics how humans perceive the world using multiple senses. With the rationale of localizing biomarkers using imaging data only, we used only the image encoder of this foundation model for headache classification.

We identified brain regions utilizing the Grad-CAM activation scores that were important for differentiating three types of headaches from HC. As expected, based on prior imaging studies of headache and headache-associated symptoms that localize to many brain regions, there were numerous brain regions contributing to classification of each headache type. It is beyond the scope of this manuscript to discuss the potential role of every identified brain region in migraine or PTH pathophysiology or symptomatology. However, regions identified in our analyses have been previously implicated in migraine and/or PTH, supporting the validity of our findings. For example, the hippocampus has been highlighted as an area associated with migraine and other chronic pain, perhaps related to stress associated with pain35,36. Atypical functional connectivity of the supramarginal gyrus has been identified amongst those with migraine, likely due to its role as a somatosensory association region. The postcentral gyrus, a key region of the pain matrix given its involvement in primary somatosensation, was identified as significant for migraine in prior research37. Several of the identified regions are part of the default mode network, a network that has previously identified to be abnormal in migraine38. Amongst those with PTH, several regions that we found to differ compared to HC have been previously identified as regions associated with PTH39. For the regions found significant for PPTH, we found similar regions in one of our prior analyses (Siddique et al.5, such as the lingual, supramarginal, middle temporal, lateral occipital, postcentral, and cerebellar cortex. Chong et al.40. showed that there are cortical thickness differences for PPTH patients in precuneus, precentral, supramarginal inferior and superior parietal regions and from our Grad-CAM activation scoring, we also found these regions to be significant for the classification prediction. Differentiating participants with headache from HC provides potentially important information about brain regions that might participate in generating each headache type or be impacted by recurrent headaches. There could eventually be clinical utility for brain MRI-based headache classification models as well. Theoretically, “normalization” of structure or function in brain regions identified as important for differentiating those with headache from HC could be used as biomarkers of headache preventive treatment response, perhaps especially useful when developing new headache therapies. Furthermore, differentiating patients with headaches from HC is an important step toward developing classification models that differentiate between headache types with overlapping features, such as migraine and PTH. Although we do not see a future in which brain MRI is required for all headache diagnoses, it could be used to supplement diagnostic decision making when the diagnosis is difficult to determine based on patient symptoms alone.

To the best of our knowledge, the only deep learning-based approach for headache classification reported in the literature is from Siddiquee et al.5 which used the same dataset as ours. The regions identified as significant for migraine using the proposed method slightly overlap with those identified by Siddiquee et al.., including the hippocampus and precentral white matter. Additionally, we found other regions that contribute to migraine, including the supramarginal cortex, superior temporal cortex, supramarginal white matter, and precuneus. We also found multiple overlapping significant regions for APTH with those identified by Siddiquee et al.., including the cerebellar cortex, lateral occipital cortex, inferior parietal, and lingual. We found some new regions that are significant for APTH, including superior temporal, middle temporal, rostral middle frontal cortex, postcentral cortex, and fusiform. In our analysis of PPTH, we identified additional overlapping significant regions consistent with our prior study, including the cerebellum, lateral occipital white matter, precuneus, postcentral cortex, and inferior parietal cortex. Furthermore, we discovered new regions, such as the precentral cortex, rostral middle frontal cortex, supramarginal cortex, and superior temporal cortex, contributing to PPTH classification. The classification accuracy of the proposed method is significantly improved for migraine and APTH classification. For migraine, we achieved an average accuracy of 89.96% for five-folds, which is significantly improved compared to the 75% accuracy achieved in our previous method. We saw similar improvements for APTH, where our new method achieved 88.13% average accuracy compared to 75% accuracy in the previous method. We achieved an accuracy of 83.13% for PPTH, which is lower than the 91.70% accuracy reported in the prior method. However, it is crucial to highlight that our evaluation utilized a more rigorous five-fold cross-validation approach, in contrast to the prior work, which assessed only a single model on a limited dataset. Consequently, while our accuracy may be lower for PPTH classification, it is indicative of a more robust and generalizable performance owing to the comprehensive evaluation methodology.

Although the classification accuracy for Migraine and APTH were higher with the current method, we found that some of the significant brain regions identified by our current method differed from those identified in our previous study. This discrepancy could be attributed to the following reasons: (i) Our current method employs a pretrained model trained on PMC-15 M, a dataset comprising 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. As a result, our current model is likely more generalized compared to the one used in the previous study. (ii) We adopted a different model architecture and training approach, specifically ViT with pretraining, which may capture features more effectively than the traditional ResNet trained from scratch. While this study focused on binary classification tasks (e.g., HC vs. each headache type), future work will explore multi-class classification (e.g., Migraine vs. APTH vs. PPTH vs. HC) to better reflect real-world diagnostic needs. We also plan to evaluate cross-site generalization and model calibration in future clinical validation studies.

Limitations of this study include: (i) Although we used a pretrained ViT model, the validation and testing sets are still small. (ii) MRIs from those with headache and HCs were obtained using various scanners and acquisition parameters. While this variability might be viewed as a limitation, it can also be considered a strength of our study. The heterogeneity in the dataset may lower classification accuracy but increases the likelihood that the classification results will generalize to new patient populations. (iii) The consistency observed in the brain regions that most contributed to classification in this study and the previous studies may partly be attributed to participant overlap between the studies. (iv) Although the participants in the IXI dataset are classified as healthy, it is possible that they were not screened for conditions such as migraine, a history of mTBI, or PTH. (v) All participants with migraine and most participants with PTH were enrolled at the Mayo Clinic. (vi) As this study relies solely on structural MRI, factors such as sex, handedness, medication use, aura presence, and white matter hyperintensities were not controlled for. This broad inclusion may have introduced variability and impacted the robustness of the model. (vii) We used six samples from each class in the test set to enable a one-to-one comparison with the only existing method in the literature, but this small sample size may lead to potential statistical instability in the reported accuracy metrics. Although the participant demographics, headache characteristics, and mTBI characteristics are consistent with expectations, it is possible that the enrolled patient population differs from the general population of individuals with migraine and PTH, which could limit generalizability of study results. In the future, we will explore other experimental designs, such as a different data split, with a larger set for testing and a smaller set for fine tuning.

Conclusion

This study, using headaches as a case example, demonstrates the potential of the pretrained CLIP model for disease classification and biomarker classification, particularly in scenarios with limited training data. Compared to the literature using deep learning (e.g., ResNet-based model) trained from scratch, the CLIP-based approach showed improved classification performance. Notably, the evaluation of our model using a more robust five-fold cross-validation process ensures that the extracted biomarkers are derived from multiple models, enhancing their reliability compared to those generated in previous work. These findings highlight the promise of pretrained models in advancing clinical diagnostic tools and underscore the importance of rigorous evaluation methodologies.