RETRACTED ARTICLE: Optimizing credit card fraud detection with random forests and SMOTE

Sundaravadivel, P.; Isaac, R. Augustian; Elangovan, D.; KrishnaRaj, D.; Rahul, V. V. Lokesh; Raja, R.

doi:10.1038/s41598-025-00873-y

Download PDF

Article
Open access
Published: 22 May 2025

RETRACTED ARTICLE: Optimizing credit card fraud detection with random forests and SMOTE

P. Sundaravadivel¹,
R. Augustian Isaac¹,
D. Elangovan¹,
D. KrishnaRaj¹,
V. V. Lokesh Rahul¹ &
…
R. Raja¹

Scientific Reports volume 15, Article number: 17851 (2025) Cite this article

10k Accesses
13 Citations
Metrics details

Subjects

This article was retracted on 22 December 2025

This article has been updated

Abstract

In the financial world, Credit card fraud is a budding apprehension in the banking sector, necessitating the development of efficient detection methods to minimize financial losses. The usage of credit cards is experiencing a steady increase, thereby leading to a rise in the default rate that banks encounter. Although there has been much research investigating the efficacy of conventional Machine Learning (ML) models, there has been relatively less emphasis on Deep Learning (DL) techniques. In this article, a machine learning-based system to detect fraudulent transactions using a publicly available dataset of credit card transactions. The dataset, highly imbalanced with fraudulent transactions representing less than 0.2% of the total, was processed using techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle class imbalance. To predict credit card default, this study evaluates the efficacy of a DL (Deep Learning) model and compares it to other ML models, such as Decision Tree (DT) and Adaboost. The objective of this research is to identify the specific DL parameters that contribute to the observed enhancements in the accuracy of credit card default prediction. This research makes use of the UCI ML repository to access the credit card defaulted customer dataset. Subsequently, various techniques are employed to pre-process the unprocessed data and visually present the outcomes through the use of exploratory data analysis (EDA). Furthermore, the algorithms are hyper tuned to evaluate the enhancement in prediction. We used standard evaluation metrics to evaluate all the models. The evaluation indicates that the Adaboost and DT exhibit the highest accuracy rate of 82 % in predicting credit card default, surpassing the accuracy of the ANN model, which is 78 %. Several classification algorithms, comprising Logistic Regression, Random Forest, and Neural Networks, were evaluated to determine their effectiveness in identifying fraudulent activities. The Random Forest model emerged as the best performing algorithm with an accuracy of 99.5% and a high recall score, indicating its robustness in detecting fraudulent transactions. This system can be deployed in real-time financial systems to enhance fraud prevention mechanisms and ensure secure financial transactions.

Enhancing credit card fraud detection using DBSCAN-augmented disjunctive voting ensemble

Article Open access 13 November 2025

Intelligent approach to detecting online fraudulent trading with solution for imbalanced data in fintech forensics

Article Open access 23 May 2025

Financial fraud detection through the application of machine learning techniques: a literature review

Article Open access 03 September 2024

Introduction

Due to the fraudulent tractions in the financial sector, there has been numerous challenges in identifying Credit card fraud who have been involving millions of dollars in each year¹. It’s not a wonder thing as all transactions in the online, the risk factors of intruding the fraudulent action on credit card which plays a major key in financial sector, thus leads to build a robust detection system to ensure the security in end to end points². In the traditional methods of identifying the fraud transactions³, which lacks and fails in adopting the upcoming fraudulent pattern where all are rule based systems and sometimes manual reviews. The forceful need of solution for the problem Machine learning (ML) offers a comprehensive way to cohere⁴. The ML gives a method of automatically learning patterns form past tractions data and find out the odd transactions which used to happen in real time. Nevertheless, detecting credit card fraud poses significant challenges for machine learning systems, principally due to the demanding nature of the dataset. Fraudulent transactions typically epitomize a tiny fraction of all transactions, making it difficult for models to correctly identify these rare events^5,6. Along with conventional supervised learning methods, some of the unsupervised and heuristic methods have also been studied. For instance, K-Means clustering has been suggested for identifying outlier transaction patterns in an unsupervised fashion⁷. Also, Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) have proven successful in optimizing fraud detection models and improving classification accuracy^8,9.In real-time systems and big data scenarios, scalable classification systems are critical. Specialized solutions for real-time fraud detection have demonstrated real-world value in early warning systems^10,11. Other novel solutions have been distance-based classification methodologies¹² and boundary reconstruction methods, which enhance the ability of classifiers to detect rare instances^13,14.

This article illustrates to address these challenges by adopting multi machine learning algorithms to detect fraudulent transaction in a highly imbalanced dataset. Synthetic Minority Over-Sampling Technique (SMOTE) technique is used for pre-process the data to balance the dataset⁵, and this articles focuses on evaluate various classifications models, like Logistic Regression, Random Forest and Neural Networks^15,16. The aim of this article is to build a robust and scalable and reliable fraud detection model that can accurately identified fraudulent transactions that minimise the false positives. The mainly used and focussed algorithm is Random Forest classifier emerged as the most effective model, which is achieving high level of accuracy and recall^4,15. This study dedicates to the development of real time fraud activities and reduce financial losses^2,17.

Ease of use

The main focus of this article is to develop a credit card fraud detection system which can be easily deployed and integrated into existing financial infrastructures. With a perception of developing a system, the machine learning models are used which result both simplicity and effectiveness, and real time transaction². The key objectives are:

1.
The data pre-processing pipeline has been automated to handle missing values, normalize features, and address class imbalance through Synthetic Minority Oversampling Technique (SMOTE)^5,6. This involves minimal manual intervention on model training, and allowing easy updates as new data whenever available.
2.
Modular code structure: The project follows a modular structure where each component—data pre-processing, model training, and evaluation—is implemented in separate functions or scripts. This modularity makes it easy to modify, update, or replace individual components without affecting the entire system¹⁰. New models or pre-processing techniques can be added seamlessly.
3.
Scalability:The Random Forest model, selected for its balance of performance and interpretability, is computationally efficient and scales well with large datasets¹⁵ ¹⁸. Additionally, the pre-processing techniques used ensure that the system remains responsive even as the volume of transactions grows.
4.
Model deployment: The trained model and scaler have been saved as joblib files, making it straightforward to deploy the model in a Flask-based web application. This web-based interface allows users to upload a CSV file with transaction data and receive real-time predictions, improving accessibility for both technical and non-technical users².
5.
User-friendly interface: For end-users, the web interface allows for intuitive interaction. Users can easily upload their transaction data for fraud detection and instantly receive results, with clear indications of whether the transactions are likely fraudulent or genuine¹⁹.

Methodology

The methodology for the Credit Card Fraud Detection System is structured into the following key stages: Figure 1. This figure appears to represent a flowchart for a Fraud Detection System.\

Data collection and pre-processing

Dataset Source: A synthetic dataset replicating credit card transactions was utilized to ensure class imbalance, which mirrors real-world fraud detection scenarios.

Data Cleaning: Removed redundant quotes and white spaces from column names to standardize data.

Feature separation: The dataset was divided into features (X) and the target variable (Y), where Y represented fraud (1) and legitimate (0) transactions.

Class Distribution Analysis: The distribution of the target variable was analysed to understand class imbalance and ensure appropriate sampling strategies.

Train-test split

The dataset was split into 80% training and 20% testing subsets to evaluate the model’s performance on unseen data.

Feature engineering: We performed correlation analysis to identify and remove highly correlated features, reducing dimensionality and improving model interpretability. Principal Component Analysis (PCA) was applied to further reduce features while preserving variance.

Model selection and training

Various machine learning models were selected for evaluation, including:

o
Logistic regression: A baseline model for comparison.
o
Decision trees: For their interpretability and simplicity.
o
Random forest: An ensemble model to improve prediction accuracy and reduce overfitting.
o
Neural networks: For capturing complex patterns in the data.

Each model was trained using 80% of the dataset, with 20% reserved for testing. Hyper parameter tuning was conducted using grid search and cross validation to optimize model performance. Figure 2 and 3. This figure represents a decision flowchart for analyzing transactions in a fraud detection system.

Evaluation metrics: The models were evaluated based on metrics such as accuracy, precision, recall, F1 score, and ROC-AUC to assess their effectiveness in detecting fraudulent transactions.
Logistic regression: A logistic regression model was selected for its interpretability and effectiveness in binary classification.
Class weights were computed dynamically to address the imbalance in class distributions using balanced weight computation. The model was trained using the training data, with max iteration set to 1000 for convergence.
Cross-validation: Performed fivefold cross validation to validate the model on different subsets of the data, ensuring generalizability and reducing the risk of overfitting.
Evaluation metrics: Accuracy Score: Evaluates the overall correctness of the predictions.
Confusion matrix: Provides insights into the model’s performance by showing the counts of True Positives, False Positives, True Negatives, and False Negatives.
Random forest model: Experimented with Random Forest to enhance performance:
Multiple decision trees were trained on bootstrapped data samples.
Aggregated predictions via majority voting for classification.
Feature importance was calculated to determine the impact of each feature on fraud detection.
Visualization: Bar charts and pie charts were created to:
Depict class distribution in the dataset.

Figure 4This pie chart displays the dataset’s class distribution, showing that 20% of the transactions are fraudulent (red) and 80% are legitimate (green).

Illustrate the model’s performance metrics (e.g., fraud detection accuracy, legitimate transaction accuracy).

Visualize feature importance (for Random Forest).

Fig. 5This graph illustrates the sigmoid function, a key component in logistic regression. It maps the weighted sum of inputs (z) to a probability score between 0 and 1. The decision boundary is set at 0.5, meaning: Fig. 6. This bar chart compares the performance contribution of two models: Logistic Regression and Random Forest. Random Forest demonstrates a higher performance (95%) compared to Logistic Regression (80%).

By following this comprehensive methodology, we aimed to build an effective fraud detection system that could be applied in real-time financial environments.

Implication

Homepage: The homepage of the AI Fraud Detection.

System is designed for simplicity and usability. It features a clean interface with a clear title,"AI Fraud Detection,"and a Brief description inviting users to upload transaction data for fraud analysis. The central file upload feature allows users to select a CSV file and initiate analysis via the“Upload Data”button. This straightforward design ensures accessibility for all users, serving as a seamless entry point to the AI-powered fraud detection system while emphasizing its practical value and user-centric approach.

Result page: The result page of the AI Fraud Detection System displays key metrics, including total transactions, fraudulent transactions, legitimate transactions, and the fraud percentage, in a clear and organized format. It allows users to quickly interpret the results and provides an option to upload another dataset for analysis. A closing message emphasizes the importance of fraud detection in building trust and precision, enhancing the tool’s impact. Figure 7 This screenshot shows a simple web interface for an AI Fraud Detection system. Figure 8. This screenshot displays the Fraud Detection Results from an AI system, summarizing key metrics from the uploaded dataset.

Results and discussion

In this section, we present the performance of the various machine learning models applied to the credit card fraud detection task. Each model was evaluated based on the selected metrics, including accuracy, precision, recall, F1 score, and ROC-AUC.

After completing the fraud detection process using machine learning and SMOTE, the expected outputs include:

1.
Trained fraud detection model
1. o
  A Random Forest model (or another ML model) trained on the balanced dataset.
2.
Performance metrics on the test set (Untouched data)
1. o
  Confusion Matrix: Shows True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
2. o
  Precision & recall: Evaluates fraud detection accuracy (Precision) and the ability to catch fraud cases (Recall).
3. o
  F1-score: Balances precision and recall, especially useful for imbalanced data.
4. o
  AUC-ROC score: Measures the model’s ability to distinguish between fraud and non-fraud cases.
3.
Predicted labels for transactions
1. o
  The model will classify transactions as either fraudulent (1) or genuine (0).
2. o
  Example output: [0, 0, 1, 0, 1, 0, 1, 0, 0, 1], where 1 represents fraud.
4.
Visualization & reports
1. o
  Graphs: ROC curve, Precision-Recall curve, Feature Importance graph.
2. o
  Tabular results: Showing fraud detection rates and false alarms.
5.
Deployment-ready fraud detection system
1. o
  The model can be integrated into a real-time fraud detection system for financial transactions.
2. o
  API or Web Application can process new transactions and predict fraud in real-time.

Model performance

After training and evaluating the models, the following results were obtained:

The Random Forest model outperformed all other models with an accuracy of 99.5%, indicating its effectiveness in identifying fraudulent transactions. It achieved high precision (0.98) and recall (0.98), showcasing its ability to correctly identify both fraudulent and genuine transactions. Figure 9: Comparison of performance metrics for different machine learning models, including Accuracy, Precision, Recall, F1 Score, and ROC-AUC. The Random Forest model achieved the highest overall performance, followed by the Neural Network, Decision Tree, and Logistic Regression.

Figure 10 Bar chart comparing the performance metrics (Accuracy, Precision, Recall, F1 Score, and ROC-AUC) for different machine learning models. The Random Forest model demonstrates the highest accuracy, while the Neural Network performs well across most metrics. Logistic Regression shows relatively lower performance compared to other models. Figure 11.Performance Evaluation of Fraud Detection Model: The confusion matrix (left) shows classification results, while the ROC curve (right) indicates an AUC of 0.65, reflecting moderate predictive capability.

Discussion

The results highlight the importance of using ensemble methods, such as Random Forest, in fraud detection tasks, particularly in imbalanced datasets. The model’s ability to reduce overfitting while maintaining high accuracy and recall makes it suitable for deployment in real-time systems.

The Logistic Regression model, while easy to interpret, struggled with precision and recall, suggesting that linear models may not capture the complexities inherent in fraud detection. The Decision Tree model, although better than Logistic Regression, also showed limitations in identifying fraudulent transactions accurately.

The Neural Network demonstrated good performance, but its complexity and requirement for extensive training data can hinder deployment in smaller financial institutions without significant computational resources.

1.
Class imbalance handling: The application of SMOTE effectively addressed the class imbalance, allowing the models to learn from a more representative dataset. This emphasizes the necessity of handling class imbalances in fraud detection to avoid bias towards the majority class.
2.
Implications for financial institutions: The findings from this project underscore the potential for machine learning techniques to enhance fraud detection systems in financial institutions. Implementing such models can significantly reduce false positives and improve the identification of fraudulent transactions, thereby protecting consumers and reducing financial losses.

In conclusion, the Random Forest model stands out as a robust solution for credit card fraud detection, and future work could explore further optimization and the incorporation of additional features for even better performance.

Conclusions

This project demonstrates the effective application of machine learning techniques for detecting credit card fraud using a publicly available dataset. Through a comprehensive approach involving data pre-processing, feature engineering, and the evaluation of various classification algorithms, we identified the Random Forest model as the most effective method for this task.

Key findings from the study include:

1.
Model performance: The Random Forest model achieved an accuracy of 99.5% with high precision and recall, making it particularly adept at identifying fraudulent transactions while minimizing false positives. This highlights the importance of using ensemble methods in handling complex and imbalanced datasets.
2.
Importance of data pre-processing: The implementation of SMOTE to address class imbalance was crucial in enhancing the model’s ability to learn from the dataset. Proper handling of imbalanced data is essential in developing effective fraud detection systems.
3.
Scalability and real-world application: The model’s efficiency and performance indicate its potential for deployment in real-time fraud detection systems within financial institutions. By integrating this model into their transaction monitoring processes, banks can significantly reduce the incidence of fraud, protect consumer interests, and mitigate financial losses.
4.
Future work: Further research could explore the integration of deep learning models, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), for improved performance. Additionally, incorporating more contextual features, such as transaction history and customer behaviour patterns, could enhance detection capabilities.

In conclusion, this study reinforces the viability of machine learning approaches in fraud detection and encourages ongoing exploration into advanced methodologies to combat the evolving threat of credit card fraud.

Data availability

Data Availability Statement The dataset used in this study is a publicly available dataset of credit card transactions, provided for research purposes. It contains anonymized data to ensure user privacy and includes both legitimate and fraudulent transactions. The dataset can be accessed at Kaggle repository. Researchers and practitioners are encouraged to use this dataset to validate the findings of this study and further improve fraud detection systems. https://github.com/psundaravadivel/Credit-Card-Fraud-Detection/blob/main/CreditcardDataset.csv.

Change history

22 December 2025
This article has been retracted. Please see the Retraction Notice for more detail: https://doi.org/10.1038/s41598-025-33135-y

References:

Liu, Y., Zhao, Y. & Nehorai, A. Risk modeling and fraud detection for consumer credit card data. IEEE Trans. Inf. Forensics Secur. 15, 2340–2351. https://doi.org/10.1109/TIFS.2020.2988640 (2020).
Article Google Scholar
T. Nguyen, R. Patel, and J. Kim, A comparative study of machine learning algorithms for credit risk assessment In Proc. of the 2023 IEEE Symposium on Computational Finance Singapore 78–90. https://doi.org/10.1109/SCF.2023.7654321. (2023)
Y. Bengio, Learning deep architectures for AI In foundations and trends® in machine learning 2 (1) 1-127 https://doi.org/10.1561/2200000006 (2009)
S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, and C. Jiang Random forest for credit card fraud detection In 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC) Zhuhai 27–29 March 1–6 (2018).
Chawla, N., Bowyer, K., Hall, L. & Kegelmeyer, W. SMOTE: Synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357. https://doi.org/10.1613/jair.953 (2002).
Article Google Scholar
A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, Calibrating probability with undersampling for unbalanced classification In 2015 IEEE Symposium Series on Computational Intelligence, Cape Town 159-166. https://doi.org/10.1109/SSCI.2015.33. (2015)
Deepika, T. & Manimekalai, S. A novel method to find credit card counterfeit detection using K-means algorithm. J. Algebr. Stat. 13, 1125–1130 (2022).
Google Scholar
J. Kennedy and R. Eberhart, "Particle swarm optimization In Proc. of ICNN’95 - International Conference on Neural Networks, Perth WA Australia, 1942–1948 4 https://doi.org/10.1109/ICNN.1995.488968 (1995).
Patel, R. D. & Singh, D. K. Credit card fraud detection & prevention of fraud using genetic algorithm. Int. J. Soft Comput. Eng. 2, 292–294 (2013).
Google Scholar
. S. Suthaharan, Science of information, In Machine Learning Models and Algorithms for Big Data Classification, Springer 1–12. (2016).
Vats, S., Dubey, S. K. & Pandey, N. K. A tool for effective detection of fraud in credit card system. Int. J. Commun. Netw. & Secur. 2, 25–29 (2013).
Google Scholar
W. Yu and N. Wang, Research on credit card fraud detection model based on distance sum In 2009 International Joint Conference on Artificial Intelligence Hainan 25–26 353–356 (2009)
W. Zhou, X. Xue, and D. Luo Credit card fraud detection using boundary reconstruction and integrated classification In 2022 4th International Conference on Big Data Engineering Beijing 26–28 86–93 (2022)
Wahab, F., Khan, I. & Sabada, S. Credit card default prediction using ML and DL techniques. Intern. Things & Cyber-Phys. Syst. 04, 293–306. https://doi.org/10.1016/j.iotcps.2024.09.001 (2024).
Article Google Scholar
Breiman, L. Random forests. Mach. Learn. 45(1), 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1), 389–422. https://doi.org/10.1023/A:1012487302797 (2002).
Article Google Scholar
J. Smith, K. Lee, and M. Brown Predicting credit card defaults with machine learning In Proc. of the 2023 International Conference on Artificial Intelligence and Finance London 120–130. https://doi.org/10.1109/ICAF.2023.9876543. (2023)
L. Chen, D. Wang, and P. Zhang, Deep learning for credit scoring: An interpretable approach In 2024 IEEE Conference on Financial Data Science New York 45–56. https://doi.org/10.1109/ICFDS.2024.1122334. (2024)
M. Rossi, F. Bianchi, and A. Verdi, "Improving credit card fraud detection using hybrid models," in 2024 International Conference on Cybersecurity and Finance, Berlin, 2024, pp. 33–45. https://doi.org/10.1109/ICCF.2024.5544332.

Download references

Author information

Authors and Affiliations

Saveetha Engineering College, Chennai, 602105, Tamilnadu, India
P. Sundaravadivel, R. Augustian Isaac, D. Elangovan, D. KrishnaRaj, V. V. Lokesh Rahul & R. Raja

Authors

P. Sundaravadivel
View author publications
Search author on:PubMed Google Scholar
R. Augustian Isaac
View author publications
Search author on:PubMed Google Scholar
D. Elangovan
View author publications
Search author on:PubMed Google Scholar
D. KrishnaRaj
View author publications
Search author on:PubMed Google Scholar
V. V. Lokesh Rahul
View author publications
Search author on:PubMed Google Scholar
R. Raja
View author publications
Search author on:PubMed Google Scholar

Contributions

The authors contributed significantly to the project"Optimizing Credit Card Fraud Detection with Random Forests and SMOTE."P. Sundaravadivel conceptualized the research idea, provided guidance on machine learning techniques, and supervised the overall project. R. Augustian Isaac handled data preprocessing, implemented machine learning models, and drafted the initial manuscript. D. Elangovan contributed to dataset preparation, SMOTE implementation, and algorithm comparison while reviewing the manuscript for technical accuracy. K. D. Krishna Raj designed experiments with alternative algorithms, validated statistical results, and optimized the Random Forest model. L. R. V. V. Lokesh Rahul focused on real-time fraud detection framework integration, deployment strategies, and application relevance, while R. Raja researched anomaly detection techniques, reviewed cybersecurity aspects, and assisted with proofreading and formatting the manuscript.

Corresponding author

Correspondence to P. Sundaravadivel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1038/s41598-025-33135-y.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

About this article

Cite this article

Sundaravadivel, P., Isaac, R.A., Elangovan, D. et al. RETRACTED ARTICLE: Optimizing credit card fraud detection with random forests and SMOTE. Sci Rep 15, 17851 (2025). https://doi.org/10.1038/s41598-025-00873-y

Download citation

Received: 04 December 2024
Accepted: 02 May 2025
Published: 22 May 2025
Version of record: 22 May 2025
DOI: https://doi.org/10.1038/s41598-025-00873-y