Large language models versus classical machine learning performance in COVID-19 mortality prediction using high-dimensional tabular data

Ghaffarzadeh-Esfahani, Mohammadreza; Ghaffarzadeh-Esfahani, Mahdi; Salahi-Niri, Aryan; Toreyhi, Hossein; Atf, Zahra; Mohsenzadeh-Kermani, Amirali; Sarikhani, Mahshad; Tajabadi, Zohreh; Shojaeian, Fatemeh; Bagheri, Mohammad Hassan; Feyzi, Aydin; Tarighat-Payma, Mohamadamin; Gazmeh, Narges; Heydari, Fateme; Afshar, Hossein; Allahgholipour, Amirreza; Alimardani, Farid; Salehi, Ameneh; Asadimanesh, Naghmeh; Khalafi, Mohammad Amin; Shabanipour, Hadis; Moradi, Ali; Zadeh, Sajjad Hossein; Yazdani, Omid; Esbati, Romina; Maleki, Moozhan; Nasr, Danial Samiei; Soheili, Amirali; Majlesi, Hossein; Shahsavan, Saba; Soheilipour, Alireza; Goudarzi, Nooshin; Taherifard, Erfan; Hatamabadi, Hamidreza; Samaan, Jamil S.; Savage, Thomas; Sakhuja, Ankit; Soroush, Ali; Nadkarni, Girish; Darazam, Ilad Alavi; Pourhoseingholi, Mohamad Amin; Safavi-Naini, Seyed Amir Ahmad

doi:10.1038/s41598-025-26705-7

Download PDF

Article
Open access
Published: 28 November 2025

Large language models versus classical machine learning performance in COVID-19 mortality prediction using high-dimensional tabular data

Mohammadreza Ghaffarzadeh-Esfahani^1,2,
Mahdi Ghaffarzadeh-Esfahani²,
Aryan Salahi-Niri¹,
Hossein Toreyhi¹,
Zahra Atf³,
Amirali Mohsenzadeh-Kermani²,
Mahshad Sarikhani⁴,
Zohreh Tajabadi⁵,
Fatemeh Shojaeian⁶,
Mohammad Hassan Bagheri²,
Aydin Feyzi⁷,
Mohamadamin Tarighat-Payma⁴,
Narges Gazmeh⁷,
Fateme Heydari⁴,
Hossein Afshar⁷,
Amirreza Allahgholipour⁷,
Farid Alimardani⁷,
Ameneh Salehi⁴,
Naghmeh Asadimanesh⁴,
Mohammad Amin Khalafi⁴,
Hadis Shabanipour⁷,
Ali Moradi⁷,
Sajjad Hossein Zadeh⁷,
Omid Yazdani⁴,
Romina Esbati⁴,
Moozhan Maleki⁷,
Danial Samiei Nasr⁴,
Amirali Soheili⁴,
Hossein Majlesi⁴,
Saba Shahsavan⁴,
Alireza Soheilipour⁴,
Nooshin Goudarzi¹,
Erfan Taherifard⁸,
Hamidreza Hatamabadi⁹,
Jamil S. Samaan¹⁰,
Thomas Savage¹¹,
Ankit Sakhuja¹²,
Ali Soroush¹²,
Girish Nadkarni¹²,
Ilad Alavi Darazam^13,14,
Mohamad Amin Pourhoseingholi^1,15 &
…
Seyed Amir Ahmad Safavi-Naini^1,12

Scientific Reports volume 15, Article number: 42712 (2025) Cite this article

3530 Accesses
1 Citations
11 Altmetric
Metrics details

Subjects

Abstract

This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral-7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.

Optimizing large language models for detecting symptoms of depression/anxiety in chronic diseases patient communications

Article Open access 30 September 2025

A strategy for cost-effective large language model use at health system-scale

Article Open access 18 November 2024

Human level information extraction from clinical reports with finetuned language models

Article Open access 24 November 2025

Introduction

The rapid advancement of large language models (LLMs) has revolutionized their practical applications across various domains, including medicine. These sophisticated models, trained on vast datasets, excel in a wide array of natural language processing tasks, demonstrating remarkable adaptability in assimilating specialized information from diverse medical fields¹. While primarily designed for next-word prediction, LLMs have emerged as powerful, evidence-based knowledge assistants for healthcare providers, offering valuable insights and support in clinical decision-making processes^2,3,4. While their main training centers on predicting the next word, LLMs can act as evidence-based knowledge helpers for healthcare providers, offering valuable insights and assistance⁵.

In medical and clinical practice, machine learning models, particularly classical machine learning (CML) models (i.e., feature-based algorithms that learn patterns from preprocessed data rather than raw inputs), have gained significant traction in predicting patient outcomes, prognoses, and mortality rates. These models typically employ supervised and unsupervised learning methods, which primarily utilize structured data⁶. However, clinical datasets often present a complex interplay of structured and unstructured information, with clinical notes serving as prime examples of the latter. Traditionally, patient information management via machine learning has followed a two-step approach: transforming unstructured textual data into a structured format, followed by training CML models on these structured datasets. This process, however, often leads to potential information loss and introduces complexities in model deployment, hindering practical application in clinical settings⁷.

While the efficacy of LLMs in handling unstructured text is well documented⁸, their performance in handling structured data and their comparative effectiveness against CML models remain a critical area of investigation. This is particularly relevant given that much of the historical medical data are stored in structured formats that are often difficult to integrate⁹. Table 1 summarizes previous studies comparing the performance of LLMs and CML approaches in medicine^{10,11,12,13,14}. Studies reported varied results, primarily due to differences in evaluated tasks (number of input features, sample size, and prediction complexity) and transformation techniques (e.g., transforming tables into textual prompts). However, they focus on tasks with a limited number of features (< 12), fail to represent real-world medical decisions, and train instances for the models (< 1000), limiting the CMLs to reach their maximum performance.

Table 1 Summary of studies comparing the performance of large Language models and classical machine learning methods in medicine using structured data.

Full size table

Our study aims to address this knowledge gap by evaluating LLMs’ predictive capabilities in the context of COVID-19 mortality prediction via a high-dimensional dataset and simple table-to-text transformation. By utilizing a sufficient number of training instances, we provide the opportunity for CMLs to reach their maximum performance, enabling a more robust comparison with LLMs. This investigation is designed to provide insights into CML versus LLM comparisons in real-world, time-sensitive, and complex clinical tasks.

Methods

Ethical consideration

The study was approved by the Institutional Review Board (IRB) of Shahid Beheshti University of Medical Sciences (IR.SBMU.RIGLD.REC.004 and IR.SBMU.RIGLD.REC.1399.058). The IRB exempted this study from informed consent. Data were pseudonymized before analysis; patients’ confidentiality and data security were prioritized at all levels. The study was completed under the Helsinki Declaration (2013) guidelines and all experiments were performed in accordance with Iran Ministry of Health regulations. Informed consents were collected from all individuals or their legal guardians. During the generation of LLM predictions, using the OpenAI API and Poe Web interface, we opted out of training on OpenAI and used no training-use models in Poe to maintain the data safety of patient information.

Study aim and experimental summary

The objective of this research is to evaluate the efficacy of CMLs in comparison to LLMs, utilizing a dataset characterized by high-dimensional tabular data. We employed a previously compiled dataset and focused our experimental efforts on the task of classifying COVID-19 mortality. As illustrated in Fig. 1, the primary experiment encompasses the following:

Assessment of the performance of seven CML models on both internal and external test sets.
The assessment of eight LLMs and two pretrained language models on the test set.
Assessment of a trained LLM’s performance on both internal and external tests.

Additionally, we investigate the performance of models necessitating training (CML and trained LLM) across varying sample sizes, coupled with an elucidation of model prediction mechanisms through SHAP analysis.

Study context, data collection, and dataset

This study was conducted as part of the Tehran COVID-19 cohort, which included four tertiary centers with dedicated COVID-19 wards and ICUs in Tehran, Iran. The study period was from March 2020 to May 2023 and included two phases of data collection. The protocol and results of the first phase have been published previously. The four COVID-19 peaks during this period covered the alpha, beta, delta, and Omicron variants.

All admitted patients with a positive swab test during the first two days of admission or those with CT scans and clinical symptoms were included in the study. A medical team collected the patients’ symptoms, comorbidities, habitual history, vital signs at admission, and treatment protocol through the hospital information system and reviewed the medical records. Laboratory values during the first and second days of admission were collected and organized from the hospitals’ electronic laboratory records using pandas (v1.5.3), and NumPy (v1.24.1). Patients with a negative PCR result in the first two days of admission or with one missing clinical record in the HIS were excluded.

The dataset included the records of 9,134 patients with COVID-19. The data were filtered to include demographic information, comorbidities, vital signs, and laboratory results collected at the time of admission (first two days).

Computational environment

All classical machine learning (CML) experiments were performed on a workstation equipped with an Intel Core i9-12900 K CPU, 64 GB of RAM, and an NVIDIA RTX 3090 GPU (24 GB VRAM), running Ubuntu 22.04 and Python 3.10. The primary packages utilized include scikit-learn (version 1.2.2), XGBoost (version 1.7.5), pandas (version 1.5.3), and NumPy (version 1.24.1).

The fine-tuning of the Mistral-7b-Instruct model was conducted using an NVIDIA A100 80GB GPU via a cloud-based environment (Google Cloud Platform), utilizing the transformers (v4.37.2), peft (v0.9.0), and bitsandbytes (v0.41.1) libraries. The QLoRA fine-tuning procedure was implemented using 4-bit quantization, gradient accumulation steps, and mixed-precision training to optimize memory usage and reduce computational cost. All LLM zero-shot experiments were conducted via the OpenAI API and Poe interface under controlled sessions to ensure reproducibility.

Data preprocessing

Supplementary Figure S1 illustrates summary of the pipeline from raw data preprocessing through feature engineering and cleaning, to the final training–test data split used for model development and evaluation.

Imputing and normalization

The features in the dataset were divided into categorical and numerical categories. To address the missing values in the numerical features, we used an iterative imputer from the scikit-learn library. This method employs iterative prediction for each feature, considering the multiple imputation by chained equations (MICE) method (16,17). Missing values in the categorical features were imputed via KNN from the scikit-learn library. For optimal model performance, the dataset was normalized via a standard scaler (18). These preprocessing steps were executed independently for the input features of the training, test, and external validation sets, ensuring a consistent approach for handling missing values across the experimental sets without information leakage.

Feature selection

The dataset comprised 81 on-admission features. The dataset was separated into external and internal validations using patient hospitals. Patients from Hospital-4 were used for external validation, whereas patients from the remaining hospitals were used for internal validation. For internal validation, we split the data with a test size of 20% and allocated 80% for training.

The output features in this study include “in-hospital mortality,” “ICU admission,” and “intubation,” with a focus solely on “hospital mortality” as the targeted feature, excluding other output features. Of the 81 features initially available, 76 were employed for training, comprising 53 categorical features and the remaining numerical values. During data wrangement, two duplicate features were dropped.

We strategically employed the lasso method for feature selection because of its effectiveness in handling high-dimensional data. The Lasso method introduces regularization by adding a penalty term to the linear regression objective function, which encourages sparsity in the feature coefficients^15,16. This approach proved to be superior to alternative methods, facilitating notable enhancements in our results. Through the application of Lasso, we derived a refined dataset that highlighted the most impactful features on the basis of their importance, aiding dimensionality reduction. We subsequently ranked and selected the top 40 features for further analyses.

Oversampling

To address the issue of class imbalance in our dataset, we employed the synthetic minority oversampling technique (SMOTE), a widely used method in machine learning, particularly for medical diagnosis and prediction tasks¹⁷. By applying SMOTE, we mitigated dataset imbalances, resulting in a more robust and reliable analysis for predicting mortality. SMOTE works by creating synthetic samples for the minority class instead of simply duplicating existing samples. It selects samples from the minority class and their nearest neighbors and then generates new synthetic samples by interpolating between these samples and their neighbors. This approach not only increases the number of samples in the minority class but also introduces new data points, improving dataset diversity. In our experiments, the SMOTE technique was applied to the training set (X_train), increasing the number of samples from 6118 to 9760.

Preparing data for the LLM

To prepare the data for input into the LLM, we completed all the previous steps for feature selection and sampling, but normalization was not performed. As shown in Fig. 1, we converted the dataset into text. We categorized the dataset features into symptoms, past medical history, age, sex, and laboratory data. For symptoms and medical history, we considered only positive data. For age, we added ‘the patient’s age is’ before the age number. For sex, we used ‘male’ and ‘female.’ We used the normal range of laboratory data to classify the data into the normal range, higher than the normal range, and lower than the normal range. For example, if blood pressure and oxygen saturation were higher than the normal range, we used the sentence ‘blood pressure and oxygen saturation are higher than the normal range.’ We considered only laboratory data that were higher or lower than the normal range. The exclusion of negative features in symptoms and past medical history, or the normal range in laboratory data, is due to limitations in LLM context windows. We then concatenated the dataset into a single paragraph for each patient, indicating their medical history.

CML predictive performance

We employed five CML algorithms: logistic regression (LR), support vector machine (SVM), decision tree (DT), k-nearest neighbor (KNN), random forest (RF), multilayer perceptron neural network (MLP), and XGBoost. The hyperparameters were optimized via a grid search and cross-validation. The full details of training and hyperparameters are provided in Supplementary Sect. 1.

LLM predictive performance

We utilized open-source and proprietary LLMs to test their predictive power on clinical texts transformed from tabular data. First, we tested different prompts to determine the most efficient prompt to use, as well as the temperature (between 0.1 and 1). Full prompts are listed in Supplementary Table S1. We then sent clinical text and commands, received the unstructured output, and extracted the selected outcome, which could be either “survive” or “die.” We used different sessions for each prediction, limiting the memory of the LLM to remembering previous generations.

We tested open-source, open-weight models of Mistral-7b, Mixtral 8 × 7 B, Llama3-8b, and Llama3-70b via the Poe Chat Interface. OpenAI models, including GPT-3.5T, GPT-4, GPT-4T, and GPT-4o, were utilized via the OpenAI API. We also tested the performance of two pretrained language models, BERT¹⁸ and ClincicalBERT¹⁹, which are fine-tuned versions of BERT on medical text. A list of all LLMs and times of use, as well as model parameters, is available in Supplementary Table S2.

Zero-shot classification

Zero-shot classification is an approach in prompt engineering in which the prompt is given to the model without any training. This approach is used in transfer learning, where a model used for different purposes is employed instead of fine-tuning a new model, thereby reducing the cost of training the new model. To perform zero-shot classification, we used eight different LLMs and two LMs. We provided each patient’s history as input to predict whether the patient would die or survive and then stored the results.

Fine-tuning LLM

We fine-tuned one of the open-source LLMs, Mistral-7b-Instruct-v0.2, which is a GPT-like large language model with 7 billion parameters. It is trained on a mixture of publicly available and synthetic data and can be used for natural language processing (NLP) tasks. It is also a decoder-only model that is used for text-generation tasks. Fine-tuning an LLM is usually considered time-consuming and expensive; recently, several methods have been introduced to reduce costs. We implemented the QLoRA fine-tuning approach to optimize the LLM while minimizing computational resources²⁰.

The model was configured for 4-bit loading with double quantization, utilizing an “nf4” quantization type and torch.bfloat16 compute data type. A 16-layer model architecture with Lora attention and targeted projection modules was employed. We used the PEFT library to create a LoraConfig object with a dropout rate of 0.1 and task type ‘CAUSAL_LM’. The training pipeline, established via the transformer library, consisted of 4 epochs with a per-device batch size of 1 and gradient accumulation steps of 4. We utilized the “paged_adamw_32bit” optimizer with a learning rate of 2e-4 and a weight decay of 0.001. Mixed-precision training was conducted via fp16, with a maximum gradient norm of 0.3 and a warm-up ratio of 0.03. A cosine learning rate scheduler was employed, and training progress was logged every 25 steps and reported to TensorBoard. This methodology, which combines QLoRA with the Bitsandbytes library, enables efficient enhancement of our language model while significantly reducing resource requirements, demonstrating superior performance across various instruction datasets and model scales. A more detailed description is provided in Supplementary Section S2.

CML and LLM performance on different sample sizes

To investigate the influence of training sample sizes on model performance, we conducted a series of experiments using varying sample sizes: 20, 100, 200, 400, 1000, and 6118. Multiple models were trained using these sample sizes, and their performance was evaluated on the basis of the F1 score and accuracy metrics via an internal test set. The objective of this exploration was to gain valuable insights into the correlation between the volume of training data and the accuracy of predictive models.

Evaluation and cross-validation

The accuracy of the outputs was assessed by comparing them against a ground truth that categorized outcomes as either mortality or survival. Outputs from the LLM were similarly classified. If an LLM initially produced an undefined result, the prompt was repeatedly presented up to five times to elicit a defined prediction; these instances are documented in Supplementary Table S2. We evaluated the models’ performance via five critical metrics: specificity, recall, accuracy, precision, and F1 score. To optimize our models, we employed a grid search strategy with accuracy as the primary criterion.

We further implemented 5-fold cross-validation on the training dataset (n = 6,118). The training data were randomly partitioned into five equal-sized subsets. For each fold, four subsets were used for training while the remaining subset served as a validation set. This process was repeated five times, with each subset serving as the validation set once. We calculated performance metrics (accuracy, precision, recall, specificity, F1 score, and AUC) for each fold and reported the mean and standard deviation across all five folds.

Statistical analysis

Baseline characteristics were compared between patients who died and those who survived using appropriate statistical tests based on variable type and distribution. Continuous variables were analyzed using the Mann-Whitney U test (chosen over parametric alternatives due to non-normal distributions typical of clinical data) and presented as mean ± standard deviation. Categorical variables were compared using Pearson’s chi-square test. The area under the receiver operating characteristic curve (AUC) was used to illustrate the predictive capacity of each model. All statistical tests were two-sided with significance set at P < 0.05. Statistical analyses were performed using Python 3.12 with SciPy (v1.16.2).

Explainability

In our study, we employed SHAP (SHapley Additive exPlanations) values to examine both the total (global) and individual (granular) impacts of features on model predictions. We normalized the numerical data via a standard scaler and adopted a model-agnostic methodology. Our model-agnostic approach involved employing XGBoost as the explainer model for LLMs prediction, which was chosen for its robust performance, as demonstrated in prior research and our own findings. SHAP values provide a clear, quantitative assessment of how each feature influences individual predictions, enhancing transparency in the model’s decision-making process.

For our analysis, we used the test set for each model, generated SHAP values for every prediction, and computed the mean and standard deviation of the absolute SHAP scores. We then converted SHAP scores from a range of 0 to 1 into “global impact percentages” by dividing each feature’s score by the total score of all features and multiplying by 100. We calculated the average impact percentages for both CMLs and LLMs by first averaging the SHAP scores and then determining the impact percentages. To compute the standard deviation of the impact percentages, we adjusted the average standard deviation of CML/LLM via a multiplication factor derived from the ratio of the impact score to the SHAP mean. The global impact percentage represents the proportion of each feature’s impact on the predicted class across the entire dataset. A violin plot visually represents the variability of each input feature’s effect on the output.

Results

Our study initially included a dataset of 9,057 patients, with a mean age of 58.40 ± 19.81 years and a male‒female ratio of 1.19. The overall mortality rate in this group was 25.11% (N = 1818). Table 2 shows the distribution of variables and missing data for both the survived and mortality cohorts. We utilized an internal validation test set and an external validation set comprising 2,470 and 2,248 participants, respectively, each with a mortality rate of 50%. Additionally, the validation set for zero-shot classification included 590 patients randomly selected from the internal validation test set, with a mean age of 63.85 ± 18.37 years, a male-to-female ratio of 355:255, and a mortality rate of 50% (mortality count = 295). Table 3 details the performance metrics of all models across internal, external, and cross-validation.

Table 2 Data are presented as mean ± standard deviation for continuous variables and n (%) for categorical variables. P-values were calculated using Mann-Whitney U test for continuous variables and chi-square test for categorical variables.

Full size table

Table 3 Model results on the internal validation test set, external validation dataset and cross validation (The result for cross validation shows average and standard deviation across five models). * ZSC validation dataset was created using a random sample of the internal validation dataset.

Full size table

Classic machine learning predictive performance

As shown in Fig. 2, XGBoost and RF were the top-performing models in terms of accuracy, achieving scores of 86.28% and 86.52%, respectively. These models also excelled in precision, recall, specificity, and F1 scores, all surpassing 85%. The MLP also delivered an acceptable performance, with an accuracy of 75.87%. When the models were applied to the external validation set, a slight decline in the AUC of 2–5% was observed. Supplementary Figures S2 and S3 depict the confusion matrix of the CMLs on the internal validation test set and the external validation set, respectively. SVM, KNN, and DT showed consistent performance across both validation sets, confirming their reliability in generalizing to unseen data.

LLM: zero-shot classification and fine-tuned mistral-7b

The zero-shot classification results showed variability among the models, with GPT-4 outperforming the other models by achieving an accuracy of 0.62 and an F1 score of 0.43 and recording the highest recall at 0.28 among the LLMs. Generally, LLMs exhibited low recall rates, predominantly classifying predictions as “mortality.” The open-source models, including Llama-3-70B, Llama-3-8B, Mistral-7b-Instruct, and Mistral-8 × 7b-Instruct-v0.1, had F1 scores ranging from 0.03 (Mistral-7b) to 0.15 (Llama3-8B and Llama3-70b). Notably, the gpt-4o model showed limited effectiveness, with an F1 score of 0.01, indicating a challenge in distinguishing between true positives and true negatives. The pretrained language models – BERT and ClinicalBERT – also labeled all outcomes as dies, failing to provide predictive power. Supplementary Figure S4 shows confusion matrix of LLM and language models.

Fine-tuning Mistral-7b significantly improved its performance, increasing the F1 score from 0.03 to 0.74 in the internal test set and to 0.69 in the external test set. This fine-tuned version also demonstrated a high recall rate of 78.98%, a substantial increase from 1% in zero-shot classification, showing its ability to accurately identify a greater proportion of actual survival instances. This consistency between internal and external validations highlights the generalizability of the fine-tuned Mistral-7b in mortality prediction. The confusion matrix of fine-tuned and zero-shot Mistral-7b is presented at Supplementary Figure S5.

Comparing models on different training sample sizes

To evaluate the impact of training sample size on model efficacy, experiments were conducted across various sample sizes. Figure 3 shows that the performance of all CMLs increased as the size of the training set increased. XGBoost demonstrated the strongest performance across all categories: small (100 samples), medium (400–1000 samples), and full training set sizes (6118 samples). Notably, the MLP neural network and SVM exhibited the most significant performance improvements, with accuracies increasing from 55% with 20 training samples to 73% and 77%, respectively.

In contrast, while the zero-shot performance of GPT-4 reached an F1 score of 0.43, CMLs still surpassed both zero-shot classification and fine-tuned LLMs in predicting COVID-19 mortality. During the fine-tuning of Mistral-7b, notable performance degradation occurred in scenarios with small training sizes, leading to a loss of broader model understanding, an effect termed “negative transfer.”

Explainability: impact of features on prediction

As shown in Supplementary Figure S5 while the global impact of features among CMLs exhibits similar patterns, with many of the top 10 impactful features being consistent, the granular impact differs significantly. For example, in the context of O2 saturation levels in patients, XGBoost, RF, DT, and MCP consider both high (increasing mortality risk) and low (increasing survival chance) levels to be significant, whereas KNN and LR focus only on low saturation levels. According to Fig. 4.a, the most influential features are age (11.18%) and O2 saturation (9.89%), followed by LOC (4.83%), lymphocyte count (4.79%), dyspnea (3.76%), and sex (3.68%).

Conversely, the influence of features in LLMs, particularly in lower-performing models such as Mistralb-7b and GPT4o, appears less coherent, as illustrated in Supplementary Figures S6 and S7. This inconsistency contributes to noise in the average feature impact among LLMs (Fig. 4.d). Nonetheless, age (6.58%) and O2 saturation (5.51%) remained the most significant features, with a series of laboratory tests, including neutrophil count, PT, ALP, MCV, K, Na, ESR, and Cr, revealing impacts in the 4%−5% range.

When comparing the top performers among CMLs and LLMs—XGBoost and GPT4—the patterns of global (Fig. 4.b and Fig. 4.e) and granular (Fig. 4.c and Fig. 4.f) impacts diverge, with XGBoost displaying more specific impacts and GPT4 showing broader ranges of impact.

Figure 5 illustrates how fine-tuning Mistral-7b altered the impact of features at both the global and granular levels. This refinement in prediction logic aligned the top 10 most important features more closely with those of CMLs, resulting in more equitable impact percentages among features and enhanced granularity.

Pipeline validation

Supplementary Table S3 presents the XGBoost model’s F1 scores for external validation, showing a result of 0.82 (AUC: 0.92) with imputation and 0.89 (AUC: 0.60) without imputation. Supplementary Table S4 presents data on external and internal validation using SMOTE to address class imbalance in CMLs. Application of SMOTE resulted in increased performance metrics for both validation sets across CMLs; for example, the XGBoost AUC rose from 0.60 to 0.92 in external validation.

Discussion

Our study reveals a notable performance gap between CML models and LLMs in predicting patient mortality via tabular data. RF and XGBoost emerged as the top CML performers, achieving over 80% accuracy and an F1 score of 0.86. In contrast, the best-performing LLM, GPT-4, achieved 62% accuracy and an F1 score of 0.43 in zero-shot classification. This disparity highlights the challenges LLMs face when dealing with purely tabular data. Notably, increasing our sample size from 5,000 patients in our previous study to 9,000 patients in this study significantly improved the performance of CML models. The AUC of RF improved from 0.82 to 0.94, underscoring the importance of large and diverse datasets in realizing the full potential of CMLs in medical tasks.

LLM performance heavily relies on the knowledge embedded within model weights, the complexity of input data, and the table-to-text transformation technique. Our approach, which uses a simple prompt and transformation to resonate with current clinical use, achieved results comparable to those of similar studies, with F1 scores of 0.50–0.60 across different medical tasks using LLMs such as the GPT-4 or GPT-3.5^10,11,21. However, in line with many previous studies, we found that CMLs can outperform this zero-shot performance with even fewer than 100 training samples^10,13.

Given the performance gap between CMLs and LLMs, researchers have explored two main approaches for improving LLM performance: pipeline improvements and fine-tuning. Previous studies have shown that LLMs can close the gap in CML performance via pipeline improvements such as prompt engineering techniques (XAI4LLM), few-shot approaches (XAI4LLM, EHR-CoAgent, TabLLM), multiple runs of LLM to double-check results (EHR-CoAgent), the addition of a tree-based explainer alongside the LLM (XAILLM), or novel LLM-based text-to-table transformation (Medi-TAB, TabLLM)^21,22,23. However, many of their evaluated tasks may not resonate with real-world use, as they have low-dimensional datasets (8–15 features) that do not reflect real-world complex medical data and limited sample sizes (< 500 instances in rare classes) that restrict CMLs from reaching their maximum performance.

The alternative approach, fine-tuning or in-context learning, aims to modify the model weights to teach the model a new task, which has been evaluated on name entity recognition and text extraction^24,25. We validated this approach in our high-dimensional task, where fine-tuning Mistral increased the F1 score from 0.03 to 0.69, even with a resource-efficient QLoRa method. Our SHAP analysis provides initial evidence of an improved rationale after fine-tuning, as the top 10 features more closely align with XGBoost and clinician decision-making.

Despite these advancements, LLMs still face significant limitations that affect their applicability in medical settings. Their vulnerability to hallucination raises concerns about producing harmful information²⁶, whereas computational constraints impose token limits that can truncate responses and diminish interaction quality^27,28.. Data privacy is another crucial concern, particularly in medical contexts, as many powerful LLMs are proprietary or require cloud-based computations, increasing the risk of data leaks²⁸. Moreover, the cost of using LLM APIs for large clinical databases can disproportionately impact low- and middle-income communities²⁹. While open-source models present a more affordable alternative, they may not match the capabilities of proprietary models.

In light of these challenges, alternative approaches have emerged, including the use of small pretrained language models and rule-based systems. These offer resource-efficient alternatives to large LLMs. Previous studies have shown that rule-based and gradient boosting algorithms can achieve strong overall performance in specific tasks, such as extracting physical rehabilitation exercise information from clinical notes^30,31. Additionally, fine-tuning pretrained BERT-like models has yielded promising results in some medical applications. However, our brief experiment with the zero-shot performance of pretrained models (BERT and ClinicalBERT) revealed their limitations, suggesting that further research is needed to optimize these approaches for complex medical tasks.

It is important to acknowledge several limitations of our study. Although the fine-tuning method used was resource efficient, it may not have been the most effective for achieving maximum performance. Fine-tuning for conversational responses instead of classification tasks with models similar to BERT may result in less reliable predictions; however, this approach mirrors how clinicians interact with AI tools. Future research could investigate ways to balance conversational accessibility and prediction accuracy. Our fine-tuned model was a small LLM with the lowest performance among our eight tested LLMs, indicating that fine-tuning larger and more accurate models could yield better results. Furthermore, our table-to-text transformation and prompts were designed to resonate with a medical user context, but more robust approaches (e.g., few-shot learning, advanced prompt engineering, and sophisticated transformation techniques) may achieve higher accuracies, especially in zero-shot classification^11,13. Although our sample size was substantial, the retrospective nature of our investigation necessitates prospective validation to confirm the generalizability of these findings. As all participating hospitals operated within our specific resource context, variations in healthcare access and quality may have influenced the generalizability of the models to other countries and settings.

Our findings highlight several critical areas for future research in the application of LLMs to medical data analysis. We propose the following research questions to advance the field:

Does the LLM explanation of the prediction (death or survival) in human language align with the feature importance analysis? Can LLMs accurately explain their rationale?
What would be the performance of fine-tuning pretrained models and large LLMs compared to small LLMs?
Could we create a model to distinguish correct answers from incorrect answers via LLM output? How can we measure the certainty of the given answer?

Conclusion

The efficacy of LLMs versus CML approaches in medical tasks appears to be contingent upon data dimensionality and data availability. In low-dimensional scenarios with limited samples, LLM-based methodologies may offer superior performance; however, as dimensionality increases and diverse sample sizes become available, CML techniques tend to outperform the zero-shot capabilities of LLMs. Notably, fine-tuning LLMs can substantially enhance their pattern recognition and logical processing, potentially achieving performance levels comparable to those of CMLs. The potential of LLMs to process both structured and unstructured data may outweigh marginally lower performance metrics than CMLs do. Ultimately, the choice between LLMs and CMLs should be guided by careful consideration of task complexity, data characteristics, and clinical context demands, with further research warranted to elucidate the precise conditions under which each methodology excels.

Data availability

The code and information for generating the output are available at https://github.com/mohammad-gh009/Large-Language-Models-vs-Classical-Machine-Learning and https://github.com/Sdamirsa/Tehran_COVID_Cohort. The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request (sdamirsa@ymail.com). We would welcome researchers to build upon our evaluation of LLMs in the context of using structured tabular dataset.

Abbreviations

LLM:: Large language model
CML:: Classical machine learning model
LR:: Logistic regression
SVM:: Support vector machine
DT:: Decision tree
KNN:: K-nearest neighbor
RF:: Random forest
XGBoost:: Extreme gradient boosting
MLP:: Multilayer perceptron
ZSC:: Zero-shot classification
LASSO:: Least absolute shrinkage and selection operator
SMOTE:: Synthetic minority oversampling technique
QLoRA:: Quantized low-lanking adaptation
MICE:: Multiple imputation by chained equations
ReLU:: Rectified linear unit
KBit:: Knowledge bit
CRP:: C-reactive protein
LDH:: Lactate dehydrogenase
NLP:: Natural language processing
CoT:: Chain-of-thought

References

Karabacak, M. & Margetis, K. Embracing large Language models for medical applications: opportunities and challenges. Cureus https://doi.org/10.7759/cureus.39305 (2023).
Article PubMed PubMed Central Google Scholar
Dathathri, S. et al. Plug and play language models: a simple approach to controlled text generation. in 8th International Conference on Learning Representations, ICLR 2020 (2020).
Han, J. M. et al. Unsupervised neural machine translation with generative language models only. arXiv preprint arXiv:2110.05448 (2021).
Petroni, F. et al. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019).
Vaid, A. et al. Generative Large Language Models are autonomous practitioners of evidence-based medicine. arXiv preprint arXiv:2401.02851 (2024).
Zhang, D., Yin, C., Zeng, J., Yuan, X. & Zhang, P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med. Inf. Decis. Mak. 20, 1–11 (2020).
Google Scholar
Sedlakova, J. et al. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digit. Health. 2, e0000347 (2023).
Article PubMed PubMed Central Google Scholar
Zhou, H. et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112 (2023).
Wornow, M. et al. The shaky foundations of large Language models and foundation models for electronic health records. NPJ Digit. Med. 6, 135 (2023).
Article PubMed PubMed Central Google Scholar
Hegselmann, S. et al. Tabllm: Few-shot classification of tabular data with large language models. in International Conference on Artificial Intelligence and Statistics 5549–5581 (2023).
Wang, Z., Gao, C., Xiao, C. & Sun, J. MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement. arXiv preprint arXiv:2305.12081 (2023).
Cui, H. et al. LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction. arXiv preprint arXiv:2403.15464 (2024).
Nazary, F., Deldjoo, Y., Di Noia, T. & di Sciascio, E. XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare. arXiv preprint arXiv:2405.06270 (2024).
Patel, D. et al. Comparative Analysis of a Large Language Model and Machine Learning Method for Prediction of Hospitalization from Nurse Triage Notes: Implications for Machine Learning-based Resource Management. medRxiv https://doi.org/10.1101/2023.08.07.23293699 (2023).
Article PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Simpson, L., Combettes, P. L. & Müller, C. L. c-lasso–a Python package for constrained sparse and robust regression and classification. arXiv preprint arXiv:.00898 (2020). (2020). (2011).
Liu, X. Y., Wu, J. & Zhou, Z. H. Exploratory undersampling for Class-Imbalance learning. IEEE Trans. Syst. Man. Cybernetics Part. B (Cybernetics). 39, 539–550 (2009).
Article Google Scholar
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in North American Chapter of the Association for Computational Linguistics (2019).
Wang, G. et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof-of-concept trial. Nat. Med. 29, 2633–2642 (2023).
Article ADS PubMed PubMed Central Google Scholar
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. & Qlora,. Efficient finetuning of quantized llms. Adv Neural Inf. Process. Syst 36, 10088–10115 (2024).
Google Scholar
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 35, 22199–22213 (2022).
Google Scholar
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Han, Z., Gao, C., Liu, J. & Zhang, S. Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608 (2024).
Kanzawa, J., Yasaka, K., Fujita, N., Fujiwara, S. & Abe, O. Automated classification of brain MRI reports using fine-tuned large Language models. Neuroradiology https://doi.org/10.1007/s00234-024-03427-7 (2024).
Article PubMed PubMed Central Google Scholar
Akbasli, I. T., Birbilen, A. Z. & Teksam, O. Human-Like Named Entity Recognition with Large Language Models in Unstructured Text-based Electronic Healthcare Records: An Evaluation Study. (2024).
O’Neill, M. & Connor, M. Amplifying Limitations, Harms and Risks of Large Language Models. arXiv preprint arXiv:2307.04821 (2023).
Savage, T. et al. Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment. medRxiv https://doi.org/10.1101/2024.06.06.24308399 (2024).
Article PubMed PubMed Central Google Scholar
Wirth, F. N., Meurers, T., Johns, M. & Prasser, F. Privacy-preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med. Inf. Decis. Mak. 21, 242 (2021).
Article Google Scholar
Gangavarapu, A. & Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions. arXiv preprint arXiv:2404.08705 (2024).
Sivarajkumar, S. et al. Mining clinical notes for physical rehabilitation exercise information: natural Language processing algorithm development and validation study. JMIR Med. Inf. 12, e52289 (2024).
Article Google Scholar
Chen, S. et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J. Am. Med. Inf. Assoc. 31, 940–948 (2024).
Article Google Scholar

Download references

Acknowledgements

We used ChatGPT with the following prompt: “Is this paragraph grammatically correct, and can you make it sound scientific? Improve the grammar to improve the English style, understanding, and coherence”. Two authors, SAASN and MG, reviewed the suggestions and accepted relevant changes. All the authors are responsible for the validity of the final draft.

Author information

Authors and Affiliations

Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Mohammadreza Ghaffarzadeh-Esfahani, Aryan Salahi-Niri, Hossein Toreyhi, Nooshin Goudarzi, Mohamad Amin Pourhoseingholi & Seyed Amir Ahmad Safavi-Naini
Faculty of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
Mohammadreza Ghaffarzadeh-Esfahani, Mahdi Ghaffarzadeh-Esfahani, Amirali Mohsenzadeh-Kermani & Mohammad Hassan Bagheri
Faculty of Business and Information Technology, Ontario Tech University, Oshawa, Canada
Zahra Atf
School of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Mahshad Sarikhani, Mohamadamin Tarighat-Payma, Fateme Heydari, Ameneh Salehi, Naghmeh Asadimanesh, Mohammad Amin Khalafi, Omid Yazdani, Romina Esbati, Danial Samiei Nasr, Amirali Soheili, Hossein Majlesi, Saba Shahsavan & Alireza Soheilipour
Digestive Disease Research Institute, Tehran University of Medical Sciences, Tehran, Iran
Zohreh Tajabadi
Department of Surgery, The Johns Hopkins University, Baltimore, MD, USA
Fatemeh Shojaeian
Student Research Committee, School of Nursing and Midwifery, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Aydin Feyzi, Narges Gazmeh, Hossein Afshar, Amirreza Allahgholipour, Farid Alimardani, Hadis Shabanipour, Ali Moradi, Sajjad Hossein Zadeh & Moozhan Maleki
MPH department, Shiraz University of Medical Sciences, Shiraz, Iran
Erfan Taherifard
Department of Emergency Medicine, School of Medicine, Safety Promotion and Injury Prevention Research Center, Imam Hossein Hospital, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Hamidreza Hatamabadi
Karsh Division of Gastroenterology and Hepatology, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA
Jamil S. Samaan
Department of Medicine, Stanford University, Stanford, CA, USA
Thomas Savage
Division of Data Driven and Digital Health (D3M), The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Ankit Sakhuja, Ali Soroush, Girish Nadkarni & Seyed Amir Ahmad Safavi-Naini
Infectious Diseases and Tropical Medicine Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Ilad Alavi Darazam
Department of Infectious Diseases, Loghman Hakim Hospital, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Ilad Alavi Darazam
Nottingham Biomedical Research Centre, Hearing Sciences, Mental Health and Clinical Neurosciences, School of Medicine, National Institute for Health and Care Research (NIHR), University of Nottingham, Nottingham, UK
Mohamad Amin Pourhoseingholi

Authors

Mohammadreza Ghaffarzadeh-Esfahani
View author publications
Search author on:PubMed Google Scholar
Mahdi Ghaffarzadeh-Esfahani
View author publications
Search author on:PubMed Google Scholar
Aryan Salahi-Niri
View author publications
Search author on:PubMed Google Scholar
Hossein Toreyhi
View author publications
Search author on:PubMed Google Scholar
Zahra Atf
View author publications
Search author on:PubMed Google Scholar
Amirali Mohsenzadeh-Kermani
View author publications
Search author on:PubMed Google Scholar
Mahshad Sarikhani
View author publications
Search author on:PubMed Google Scholar
Zohreh Tajabadi
View author publications
Search author on:PubMed Google Scholar
Fatemeh Shojaeian
View author publications
Search author on:PubMed Google Scholar
Mohammad Hassan Bagheri
View author publications
Search author on:PubMed Google Scholar
Aydin Feyzi
View author publications
Search author on:PubMed Google Scholar
Mohamadamin Tarighat-Payma
View author publications
Search author on:PubMed Google Scholar
Narges Gazmeh
View author publications
Search author on:PubMed Google Scholar
Fateme Heydari
View author publications
Search author on:PubMed Google Scholar
Hossein Afshar
View author publications
Search author on:PubMed Google Scholar
Amirreza Allahgholipour
View author publications
Search author on:PubMed Google Scholar
Farid Alimardani
View author publications
Search author on:PubMed Google Scholar
Ameneh Salehi
View author publications
Search author on:PubMed Google Scholar
Naghmeh Asadimanesh
View author publications
Search author on:PubMed Google Scholar
Mohammad Amin Khalafi
View author publications
Search author on:PubMed Google Scholar
Hadis Shabanipour
View author publications
Search author on:PubMed Google Scholar
Ali Moradi
View author publications
Search author on:PubMed Google Scholar
Sajjad Hossein Zadeh
View author publications
Search author on:PubMed Google Scholar
Omid Yazdani
View author publications
Search author on:PubMed Google Scholar
Romina Esbati
View author publications
Search author on:PubMed Google Scholar
Moozhan Maleki
View author publications
Search author on:PubMed Google Scholar
Danial Samiei Nasr
View author publications
Search author on:PubMed Google Scholar
Amirali Soheili
View author publications
Search author on:PubMed Google Scholar
Hossein Majlesi
View author publications
Search author on:PubMed Google Scholar
Saba Shahsavan
View author publications
Search author on:PubMed Google Scholar
Alireza Soheilipour
View author publications
Search author on:PubMed Google Scholar
Nooshin Goudarzi
View author publications
Search author on:PubMed Google Scholar
Erfan Taherifard
View author publications
Search author on:PubMed Google Scholar
Hamidreza Hatamabadi
View author publications
Search author on:PubMed Google Scholar
Jamil S. Samaan
View author publications
Search author on:PubMed Google Scholar
Thomas Savage
View author publications
Search author on:PubMed Google Scholar
Ankit Sakhuja
View author publications
Search author on:PubMed Google Scholar
Ali Soroush
View author publications
Search author on:PubMed Google Scholar
Girish Nadkarni
View author publications
Search author on:PubMed Google Scholar
Ilad Alavi Darazam
View author publications
Search author on:PubMed Google Scholar
Mohamad Amin Pourhoseingholi
View author publications
Search author on:PubMed Google Scholar
Seyed Amir Ahmad Safavi-Naini
View author publications
Search author on:PubMed Google Scholar

Contributions

MoGE: Conceptualization, Methodology, Programming, Investigation, Writing Original Draft; MaGE: Investigation, Methodology; ASN: Investigation, Methodology; HT: Writing Original Draft, Methodology; ZA: Investigation, Programming; AMK: Investigation; MS: Investigation; ZT: Investigation; FS: Investigation; MHB: Investigation; AF: Investigation; MT: Investigation; NG: Investigation; FH: Investigation; HA: Investigation; AA: Investigation; FA: Investigation; AS: Investigation; NA: Investigation; MAK: Investigation, Project Administration; HS: Investigation; AM: Investigation; SHZ: Investigation; OY: Investigation; RE: Investigation; MM: Investigation; DSN: Investigation; ALS: Investigation; HM: Investigation; SS: Investigation; ARS: Investigation; NG: Investigation; ET: Investigation, Validation; HH: Investigation; JSS: Reviewing and Editing the Manuscript; TS: Reviewing and Editing the Manuscript; AKS: Reviewing and Editing the Manuscript; ALS: Methodology, Validation, Reviewing and Editing the Manuscript; GN: Reviewing and Editing the Manuscript; IAD: Data Acquisition, Reviewing and Editing the Manuscript, Administration, Supervision; MAP: Data Acquisition, Reviewing and Editing the Manuscript, Administration, Supervision, Validation; SAASN: Conceptualization, Methodology, Programming, Data Curation, Writing and Editing the Original Draft, Project Administration, Supervision.

Corresponding authors

Correspondence to Ilad Alavi Darazam, Mohamad Amin Pourhoseingholi or Seyed Amir Ahmad Safavi-Naini.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ghaffarzadeh-Esfahani, M., Ghaffarzadeh-Esfahani, M., Salahi-Niri, A. et al. Large language models versus classical machine learning performance in COVID-19 mortality prediction using high-dimensional tabular data. Sci Rep 15, 42712 (2025). https://doi.org/10.1038/s41598-025-26705-7

Download citation

Received: 02 September 2024
Accepted: 30 October 2025
Published: 28 November 2025
Version of record: 28 November 2025
DOI: https://doi.org/10.1038/s41598-025-26705-7

Subjects

Abstract

Similar content being viewed by others

Optimizing large language models for detecting symptoms of depression/anxiety in chronic diseases patient communications

A strategy for cost-effective large language model use at health system-scale

Human level information extraction from clinical reports with finetuned language models

Introduction

Methods

Ethical consideration

Study aim and experimental summary

Study context, data collection, and dataset

Computational environment

Data preprocessing

Imputing and normalization

Feature selection

Oversampling

Preparing data for the LLM

CML predictive performance

LLM predictive performance

Zero-shot classification

Fine-tuning LLM

CML and LLM performance on different sample sizes

Evaluation and cross-validation

Statistical analysis

Explainability

Results

Classic machine learning predictive performance

LLM: zero-shot classification and fine-tuned mistral-7b

Comparing models on different training sample sizes

Explainability: impact of features on prediction

Pipeline validation

Discussion

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links