Introduction

Machine learning (ML) approaches are becoming increasingly common for predictive analytics applications like churn prediction. With the help of these techniques, companies can predict client attrition, spot possible threats, and create proactive customer retention plans. ML models are trained on a representative sample of both the majority and minority classes, balanced datasets enable them to discover the underlying patterns and characteristics connected to each class. ML models can predict unseen data more accurately and broaden their applicability to real-world events by training on balanced data.

Churn prediction involves leveraging data insights to anticipate customer churn and intervene promptly. Customer loyalty is built upon exceptional services and exceeding customer expectations. By harnessing data and utilizing advanced analytics, financial companies can strengthen customer relationships and maintain a competitive edge in the market1.

Customer churn is a key sector concern, influencing long-term profitability and client retention efforts2. Businesses that fail to foresee and control churn risk losing valued consumers. Many studies have considered strategies for accurate churn prediction to enable proactive decision-making. Existing churn prediction algorithms generally suffer from imbalanced datasets, where the dominant class (non-churned customers) greatly surpasses the minority class (churned customers). This imbalance leads to biased models that misclassify churned clients, lowering the overall effectiveness of prediction systems3,4. Researchers pointed out that traditional models favours the dominant class, producing good accuracy but low recall for the minority class5. There is a demand to address this challenge by requiring novel ways to boost model sensitivity and assure impartial representation of both classes. A survey identified 24 studies from 2003 to 2019 focusing on single, hybrid, and ensemble method design architectures, statistical analysis of classification algorithms, and datasets used in various research papers6.

Several studies have investigated approaches to alleviate class imbalance, particularly using data resampling and advanced machine learning algorithm. Oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and random oversampling have been widely utilized by the researchers to produce synthetic minority class instances, henceforth enhancing model resilience and prediction performance7. But they may not always generalize effectively across datasets. Some researchers have further refined SMOTE by adding additional mechanisms, such as outlier detection, as proved in the Adaptive Synthetic (ADASYN) technique. This approach has been used in credit card fraud detection and churn prediction with promising results8. Moreover, ensemble learning approaches, including bagging, boosting, and stacking, have been demonstrated by the researchers to enhance churn prediction by integrating several base models to reduce bias and improve accuracy9. For instance, earlier research has revealed that models combining Random Forest and XGBoost with SMOTE can obtain an accuracy of above 91%10. Similarly, hybrid approaches employing Support Vector Machines (SVM) and various resampling techniques have also proved effective in4. Research works such as hybrid sampling methodology, in conjunction with Random Forest, generated better results in a comparison analysis, underscoring the necessity of investigating diverse resampling procedures11. All existing approaches have not used balanced_accuracy metric to evaluate the models.

To obtain the best results, the choice of ensembles and resampling methodology should be customised for the specified dataset. SMOTE and other oversampling methods can increase sensitivity, but sometimes they might not produce good accuracy when compared to models trained on original datasets. Even though these techniques have the potential to enhance prediction accuracy, the inherent difficulties of class imbalance still require careful consideration and further investigation.

A balance between sensitivity and accuracy must be understood for customer retention initiatives to be successful. Churn prediction models should take into account both sensitivity and the overall impact on business outcomes. These models can be enhanced using machine learning and advanced analytics, which can help businesses market their goods. Additionally, it enhances customer happiness and fortifies the brand’s image. Businesses can establish themselves as industry leaders by leveraging these technologies.

This research focuses on enhancing churn prediction models through a combination of synthetic minority oversampling (SMOTE) and ensemble learning techniques. Unlike standard models that struggle with imbalanced datasets, the proposed technique boosts accuracy, making it appropriate for financial institutions, telecoms, and subscription-based services. For instance, early detection of high-risk clients enables banks to provide customized incentives, lowering the number of account closures. Better churn prediction algorithms can also assist telecommunications companies enhance customer service interactions by providing better contract conditions before churn happens. By developing machine learning models, firms may proactively identify high-risk consumers and adopt focused retention measures, thereby reducing revenue loss. This study improves upon earlier research by integrating SMOTE with several ensemble classifiers, assessing their performance, and identifying the most robust technique for churn prediction.

Specifically, the study assesses the influence of homogeneous ensemble classifiers (such AdaBoost and Gradient Boosting) on prediction performance in conjunction with SMOTE. This work offers important insights for optimizing churn prediction algorithms for practical applications by methodically evaluating these models against current approaches.

The research contributions are as follows.

  • It offers a thorough examination of how unbalanced datasets affect the functionality of several machine learning models, such as single classifiers and homogenous ensemble classifiers.

  • It shows how well SMOTE balances datasets, which helps ML models better spot trends in the minority class.

  • It integrates SMOTE with ensemble techniques like AdaBoost and Gradient Boosting to improve prediction performance and reduce overfitting.

  • It provides useful insights by thoroughly contrasting ensemble and single classifier approaches on balanced and unbalanced datasets. The performance is assessed using Balanced Accuracy metric in the proposed work. To ensure that the majority and minority classes are adequately represented in performance evaluations, balanced accuracy is a vital criterion for assessing models on imbalanced datasets.

  • It emphasizes how feature significance and hyperparameter adjustment can enhance the precision and comprehensibility of the model.

  • By locating and optimizing the best classifiers, it offers a reliable method for predicting churn and evaluating the general health of client accounts.

Related work

When customers discontinue their relationship with a business or switch to competitors, that is known as customer churn. While the specific definition of churn varies across different industries, a common example is the banking sector. Here, customers are typically classified as churned if their financial activity, such as transaction volume and account balance, drops significantly — often by 30% annually12. The high cost of losing profitable customers has shifted business focus from acquiring new clients to retaining existing ones. Customer churn is not an isolated event; it can have a ripple effect within social groups. This is because customers often influence each other’s decisions, and the departure of one customer can increase the likelihood of others leaving as well13. One potential strategy for managing the connection between customer satisfaction and loyalty involves restricting a customer’s ability to switch providers through contractual obligations14.

Businesses prioritize profitability when selecting predictive models and often employ metrics like Maximum Profit Criterion (MPC) and Expected Maximum Profit (EMP) for this purpose. These financial measures surpass conventional performance metrics by directly assessing a model’s ability to maximize profits while retaining high-value customers15. Bogaert and Delaere proved that heterogeneous ensembles outperform in terms of Area under the Curve (AUC) and EMP9. Kurtcan and Ozcan enhanced the performance of a Support Vector Machine (SVM) model by optimizing its settings using Grey Wolf Optimization. They applied this improved SVM to predict customer churn in the telecommunications sector and compared its accuracy to other models like logistic regression, Naive Bayes, decision trees, and a standard SVM. The findings revealed that the optimized SVM consistently produced better results than the other models tested16.

By analyzing oversampling, undersampling, and hybrid approaches on an unbalanced telecommunications dataset, Jia-Xuan et al. examine resampling techniques for customer churn prediction and discover that a hybrid SMOTE-ENN approach produces an F1 score of 95.3% and an accuracy of 96.0%17. SMOTE with SVM is used in churn prediction to handle imbalances in churn datasets4,18. The performance of SVMs is measured with the gain score. SVM with RBF kernel performs with a gain measure of 80% in 20th percentile4. A combination of SMOTE and ensemble learning enhances classification performance by addressing class imbalance and improves F1-Score. Different classification algorithms and voting strategies supported the efficiency of the approach19. Faruq et al. used various classifiers, including K-Neighbors (KNN), Random Forest, XGboost, Adaboost, and Ensemble Model, on a Kaggle dataset with imbalances. Random Oversampling was also added producing average accuracy of 97.31%20. SMOTE with Genetic Algorithm (GA) for feature selection are combined and applied on four different classifiers. KNN outperformed with precision and an F-measure of 96%7.

Several research studies have employed various undersampling techniques to address the imbalance in telecommunication churn datasets21. An overview of sampling techniques used in churn prediction models is listed in Table 1.

Table 1 Sampling techniques employed in churn prediction.

An ensemble approach was explored by combining various classifiers, including neural networks, XGBoost, AdaBoost, KNN, SVM, random forests, and logistic regression. These combined models were compared to their counterparts on customer churn datasets. The results indicated that an ensemble composed of logistic regression, neural networks, and AdaBoost achieved the highest predictive accuracy21. SMOTE, a technique for oversampling minority classes, has been utilized in churn prediction research, as evidenced by studies22,29,30.

Jiang proposed a novel approach for feature selection in churn prediction by combining a modified multi-objective atomic orbital search with an extreme learning machine. Inspired by quantum mechanics, this method employs atomic orbital-like structures to effectively navigate the complex landscape of multi-objective optimization31. Zue et al. introduced a novel ensemble method based on bagging to enhance churn prediction accuracy in imbalanced datasets32. This approach involves generating diverse training subsets with varying class ratios using a cost-weighted negative binomial distribution. Subsequently, cost-sensitive logistic regression with a Lasso penalty is applied to combine the predictions of multiple base classifiers. Comparative analysis with twelve state-of-the-art methods on telecommunication data demonstrated superior performance of the proposed model in terms of traditional and profit-based evaluation metrics. A heterogeneous ensemble model is suggested for churn prediction in the telecom sector, that uses distinct benefits of each base classifier along with the group knowledge to enhance prediction performance with a meta-learner33. Kate et al. explored the application of various Generative Adversarial Network (GAN) techniques to address class imbalance by oversampling the minority class in churn prediction34. Comparative analysis with other classifiers demonstrated that GAN-based oversampling significantly enhanced predictive performance.

Brito et al. employed a combined oversampling and undersampling approach to balance a financial dataset35. The ADASYN method was initially used to introduce 5% synthetic churn customers through oversampling, followed by the NearMiss6 technique to reduce the non-churn customer population by 65%. Pustokhina et al. employed a multi-objective rain optimization algorithm to determine the optimal sample size when applying the Synthetic Minority Oversampling Technique (SMOTE)27. Specifically, average account balance, customer relationship length, and transaction frequency were identified as key financial indicators of customer churn. Additionally, behavioral patterns have been recognized as crucial factors in predicting churn across various industries, including online gaming, telecommunications, and finance36. Mena et al. introduced a dynamic recency, frequency, and monetary value (RFM) approach utilizing deep learning for churn prediction in the financial sector37. While RFM remains a foundational framework, subsequent research has expanded its scope by incorporating additional variables to enhance predictive accuracy. Similarly, Smaili and Hachimi expanded the traditional RFM model by incorporating customer purchasing behaviour diversity, creating the RFM-D framework38.

Amin et al. employed a genetic algorithm in conjunction with Naive Bayes to identify the most relevant features for predicting customer churn in the telecommunications sector39. Their method demonstrated superior performance compared to other advanced techniques. Utilizing SMOTE and ADASYN for resampling, RF achieved high accuracy and ROC-AUC scores in predicting churn40. Sadeghi et al. integrated Principal Component Analysis (PCA) and Particle Swarm Optimization (PSO) with the K-means algorithm to enhance clustering performance41. By optimizing centroid initialization and reducing data dimensionality, this approach aimed to improve clustering accuracy. The developed system exhibited sensitivity and specificity score of 75% and 99.8% respectively. A summary of various feature selection techniques employed for customer churn prediction is presented in Table 2.

Table 2 Feature selection techniques employed in churn prediction.

Simovic et al. utilized logistic regression with a mixed penalty to mitigate overfitting47. Their findings indicate that incorporating a mixed penalty into logistic regression led to improved performance in classification metrics compared to the standard logistic regression model on CRM dataset. Recent studies have leveraged deep learning for customer churn prediction, achieving an impressive accuracy F1 score of 91% using CNN. However, the complex nature of these models, characterized by numerous parameters and intricate layers, hinders interpretability. As a result, understanding how input data influences model outputs becomes challenging48. A recent study identified service quality, customer satisfaction, subscription plan upgrade offers, and network coverage as key factors influencing customer churn within the Danish telecommunications industry49. Consistent with other findings, Li et al. demonstrated the superior performance of Light Gradient Boosting Machine (LightGBM) over SVM, Random Forest, and XGBoost in a financial dataset context50,51. To address the class imbalance issue, the authors50 further enhanced LightGBM by incorporating a focal loss function. This loss function prioritizes misclassified minority class instances and difficult-to-classify samples, thereby improving the model’s ability to handle imbalanced data. The proposed model was able to identify the churn rate of 0.94 producing an AUC score of 0.99.

Proposed methodology

Financial companies are increasingly utilizing data insights and advanced analytics techniques to predict churn and retain valuable customers. Exploratory data analysis (EDA) is used to analyze collected data, identifying outliers and potential variables. Predictive insights enable data-driven decisions and resource allocation to retain valuable customers. This strategy enables financial companies to foster long-term customer relationships, optimize business outcomes, and drive sustainable growth in a competitive market.

Exploratory data analysis and dataset nature

This research aims to develop a model that helps bank executives identify customers at risk of discontinuing their banking services. The study utilizes a dataset sourced from Kaggle, available at https://www.kaggle.com/shrutimechlearn/churn-modelling.

The dataset consists of 10,000 complete customer records with no missing values. It includes 14 attributes, with 7 related to personal customer details, 4 reflecting account status, and 2 providing insights into product usage. The target attribute is binary and represents the customer’s exit status.

Visualizing variable distributions

Categorical information is transformed into numerical form. The distribution of categorical variables is depicted in Fig. 1. The distribution communicates changeable information.

Fig. 1
figure 1

Categorical and continuous variable distribution.

Categorical Variables such as Geography, IsMale (Gender), HasCrCard, and IsActiveMember follow the Bernoulli distribution. Continuous Variables like Credit Score and Balance are distributed normally and Estimated Salary is distributed uniformly.

Customer characteristics

The proportion of clients by nationality and gender must be established to understand client details. Figure 2 shows the gender distribution across the nation and the proportion of clients by nationality and gender overall. gender distribution across the nation and the proportion of clients by nationality and gender overall.

Fig. 2
figure 2

Gender with demographic distribution.

According to the graph, 50% of the clients are French, with the remainder split evenly between customers from Spain and Germany. When it comes to gender, men hold somewhat more accounts than women, and this is reflected in nationwide data.

The estimated pay distribution by gender is shown in Fig. 3. Their median incomes are comparable, and there isn’t much of a wage gap.

Fig. 3
figure 3

Salary distribution based on gender.

There are no substantial wage variations based on age. Figure 4 illustrates that a few consumers are older citizens, with a bit of variance in salary range.

Fig. 4
figure 4

Salary distribution based on age.

Correlation of factors

The correlation matrix can be used to determine the link between variables. The correlation map for the variables is shown in Fig. 5. There exists a strong relationship between the Nationality of Germany and account balance. Balance is correlated with age using the exit variable. Germans exhibit a higher rate of attrition.

Fig. 5
figure 5

Measuring the correlation between factors.

Relationships of attributes with a target variable

The distribution of parameters related to the target variable is depicted in Fig. 6. Using a Sankey diagram depicted in Fig. 7, it can be seen how information moves from one variable to another. The width of the flow represents the ratio of quantity flow between variables.

Fig. 6
figure 6

Distribution of parameters related to a target variable.

Fig. 7
figure 7

Sankey diagram depicting the flow of attributes with target value.

Here are some conclusions we can draw from the charts above. In terms of gender, women are less prevalent than men, but they shut accounts at a higher rate. In Germany, the percentage of customers who have left is more significant (32%, or roughly 2x higher), compared to Spain and France, where it is lower (16% each).

Customers under the age of 40 and those over the age of 65 are more likely to maintain their accounts. The decision to remain a customer at the bank is unaffected by whether one has a credit card (20% of customers in both groups have left). Compared to active customers, non-active members are more likely to stop using a bank’s service.

The initial exploratory investigation provides insights into customers and suggests a customer-focused strategy.

Building single and ensemble classifiers

Single classifiers

In machine learning, a model is trained to carry out a particular task—like classification—without aggregating the results of other models and is referred to as a single classifier. The following are the salient features of single classifiers:

  • Independence: A single classifier functions on its own, generating predictions from the input data using a custom learning method.

  • Simplicity: When opposed to ensemble approaches (which mix numerous models), single classifiers are typically easier to develop and comprehend.

Single Classifiers are categorized as parametric and Non-parametric.

  • Parametric classifiers: Require fewer data points, have a set number of parameters, and assume a certain structure for the data distribution. Naive Bayes, logistic regression, and LDA are a few examples.

  • Non-parametric classifiers: Generally, need more data points and have a variable number of parameters that increase with the data. They also do not presuppose a particular shape for the data distribution. KNN, decision trees, and non-linear SVMs are a few examples.

Table 3 shows the included classifiers for the study.

Table 3 Single classifiers under study.

Table 4 shows the representation Single Classifiers.

Table 4 Single classifiers representation.

Algorithm 1 explains the pseudo-code of single classifier function.

figure a

Algorithm 1 Pseudo-code of single classifier function.

Homogenous classifiers

A homogeneous classifier ensemble employs several instances of the same learning algorithm to create a more powerful predictive model. The secret to their effectiveness is to vary the data subsets, feature subsets, or hyperparameter settings used to create diversity among the classifiers. Known methods such as Gradient Boosting, AdaBoost, and Random Forest are instances of homogeneous classifier ensembles. Compared to individual classifiers, these ensembles enhance performance, and robustness, and lower the chance of overfitting.

Homogeneous classifiers’ predictions are combined using voting, boosting, and bagging techniques. Voting determines the ultimate forecast for classification jobs while boosting combines weak learners to create strong ones. Bagging trains classifiers on bootstrap samples, averaging their predictions.

Table 5 shows the homogenous classifiers used for the study.

Table 5 Homogenous classifiers.

Table 6 shows the homogenous classifier representation and its description.

Table 6 Homogenous classifiers representation.

Figure 8 shows the pseudo-code representation of homogenous classifier functioning.

Fig. 8
figure 8

Steps involved in pre-processing.

Proposed methodology

The problem pertains to binary classification, where data is categorized into two classes. The preprocessing step removes unnecessary attributes and converts all attributes into numerical data. Subsequently, the dataset is segmented as test sets and training sets. The training set is put into action to train multiple supervised classifiers, allowing the generation of models. The performance of these classifiers is then evaluated using appropriate metrics, with the test data serving as a benchmark. This comprehensive approach makes it possible to assess and compare the performance of classifiers, facilitating informed decision-making in binary classification. The dataset consists of 20% terminated clients and 80% non-terminated clients, highlighting an imbalance in data distribution.

The proposed flow diagram of the process is illustrated in Fig. 9.

Fig. 9
figure 9

Proposed work flow diagram.

The model-building process was conducted in two stages: using imbalanced and balanced data. With the imbalanced data, the performance of different classifiers was tested. Different classifiers, including single and homogenous classifiers, are considered for evaluating the performance. The process involved in preprocessing is shown in Fig. 8.

Data preprocessing

The dataset is examined for missing values. There are no missing values in the data set. The attributes ‘RowNumber,’ ‘CustomerId,’ and ‘Surname’ are removed from consideration as they are not relevant to the target attribute. The categorical attributes ‘Geography’ and ‘Gender’ are transformed into numerical data using one-hot encoding. Standard Scalar normalization is used to scale the data. Data is split into 80:20 proportions contributing to training and testing respectively.

Model building

In the first phase, a set of single classifiers was built and their performances were assessed. The performance of parametric and non-parametric classifiers is analyzed. In the second phase, homogenous classifier models were constructed and their performances were recorded. Hyperparameter tuning is done using GridSearch() method for certain single and homogenous classifiers. The GridSearch method is the hyperparameter tuning methodology that thoroughly explores a predetermined set of hyperparameters to identify the optimal combination that enhances model performance. With the suggested tuned parameters.

classifiers are constructed. The significance of features in hyper-tuned classifiers is plotted. For the churn data under study, the optimized classifier yielded appreciable performance. Best-performing classifiers are selected for further study.

Balanced dataset creation

Balanced Dataset is created using SMOTE (Synthetic Minority Over-sampling Technique). It works by generating synthetic samples of the minority class to balance the class distribution.

The steps followed in SMOTE are given below.

  1. 1.

    Determine which samples in the dataset belong to the minority class.

  2. 2.

    Choose a sample (x) at random from the minority class.

  3. 3.

    Using Euclidean distance or similar distance metric, find the selected sample x’s k-nearest neighbors among the minority class samples (where k is usually a user-defined value).

  4. 4.

    Choose at random one of the k-nearest neighbors, let’s say xnn.

  5. 5.

    Using the following formula, create a synthetic sample called xnew along the line segment between x and xnn.

$$\:{x}_{new\:}=x+\lambda\:.({x}_{nn}-x)$$
  1. 6.

    Continue the above steps until the required quantity of synthetic samples is produced to achieve equilibrium.

The balanced dataset is subjected to selected classifiers and the metrics are compared. Balanced accuracy improved by introducing sampling techniques. In this study, balancing is done only on the training set to prevent bias in the trained model while leaving the test data.

Evaluation metric

The confusion matrix is a valuable tool for summarizing categorization results, providing clear insights into the model’s performance in positive and negative scenarios. Various metrics are considered to evaluate the model’s effectiveness, including accuracy, balanced accuracy, AUC, precision, recall, and F1-score. These evaluation metrics help gauge the model’s performance across different dimensions, contributing to a comprehensive assessment of its classification capabilities.

When the classes in the dataset are balanced, or when each class has about the same number of occurrences, accuracy becomes a relevant indicator. When datasets are unbalanced, accuracy can be deceptive. For instance, a model could attain high accuracy by consistently predicting the majority class if one class is substantially more frequent than the other.

When working with unbalanced datasets that include underrepresented classes, balanced accuracy is especially helpful. By accounting for recall in each class, it offers a more illuminating assessment of the model’s performance. In contrast to majority class dominance, balanced accuracy provides a more equitable evaluation of model performance across classes.

By averaging the recall (sensitivity) for each class, balanced accuracy corrects for unequal class distributions. It assigns equal weight to each class’s performance. Balanced accuracy is computed as shown in Eq. (1).

$$\:Balanced\:Accuracy=\:\frac{1}{2}\:(\:\frac{TP}{TP+FN}+\:\frac{TN}{TN+FP}\:\:)\:$$
(1)

where \(\:\frac{TP}{TP+FN}\)—Recall for positive class and \(\:\frac{TN}{TN+FP}\)—Recall for negative class

Algorithm 2 explains the Pseudo-code of Homogenous Classifier.

figure b

Algorithm 2 Pseudo-code of homogenous classifier

Results and discussion

Single classifier performance on an imbalanced churn data

Figures 10 and 11 depict the Accuracy, AUC, precision, recall, and f1-score metrics of parametric and non-parametric classifiers under study on imbalanced churn data. According to analysis on Parametric Classifiers, the Multi-Layer Perceptron (MLP) is the top-performing model, with the highest metrics in every category, including accuracy (86.52%), balanced accuracy (73.98%), AUC (85.09%), precision (88.44%), recall (95.45%), and F1 score (91.81%). Hence, MLP is the best deployment option, especially when efficiently managing unbalanced data. With strong balanced accuracy (64.65%), AUC (79.39%), and F1 score (88.99%), Naive Bayes (NB) also demonstrated strong performance, establishing itself as a dependable substitute for situations that need interpretability and simplicity. Both linear discriminant analysis (LDA) and logistic regression (LR) performed somewhat well, with strong recall but little capacity to correct for class imbalance. LSVM and QDA are considered as weak models as LSVM suffers from overfitting and QDA struggles in all metrics.

Fig. 10
figure 10

Performance comparison of parametric classifiers.

Fig. 11
figure 11

Performance comparison of non-parametric classifiers.

With the greatest accuracy (85.72%), balanced accuracy (68.51%), AUC (83.73%), and F1 score (91.57%), the Kernel Support Vector Machine (KSVM) performs better than the other models, according to the non-parametric classifier performance evaluation. It is also quite successful at accurately recognising affirmative cases while striking a good balance between precision and memory, exhibiting strong precision (85.95%) and extraordinarily high recall (97.98%). With a competitive F1 score of 89.42%, a balanced accuracy of 64.23%, and accuracy of 82.20%, the K-Nearest Neighbours (KNN) model comes in second. This suggests that the model performs reliably, especially when it comes to identifying positive cases (precision: 84.46%, recall: 95%). Although the Decision Tree (DT) model achieves balanced accuracy (67.90%) and reasonable accuracy (78.92%), it performs poorly in terms of AUC (67.90%) and F1 score (86.70%), indicating a propensity for overfitting.

Figure 12 shows the error in the classification of samples. According to the misclassification rate analysis, the Multi-Layer Perceptron (MLP) makes the most accurate predictions out of all the models, achieving the lowest rate (13.48%). With misclassification rates of 14.28% and 17.8%, respectively, Kernel Support Vector Machine (KSVM) and K-Nearest Neighbours (KNN) likewise exhibit good overall reliability. Conversely, models such as Decision Tree (DT) and Quadratic Discriminant Analysis (QDA) show greater misclassification rates of 21.08% and 24.2%, respectively, indicating their poorer predictive ability and possible overfitting problems and accuracy.

Fig. 12
figure 12

The error rate of single classifiers on imbalanced churn data.

It is observed that among the parametric classifiers Multilayer Perceptron (MLP) performs well and Kernel SVM yields better results in the non-parametric category.

Table 7 shows the accuracy and balanced accuracy scores of single classifiers.

Table 7 Accuracy and balanced accuracy scores of single classifiers.

It is evident that balanced accuracy is not compatible with the accuracy score indicating the bias on the target value.

Homogenous classifier performance on an imbalanced churn data

Figure 13 depicts the Accuracy, AUC, precision, recall, and f1-score metrics of homogenous classifiers under study on imbalanced churn data. Tree-based models score well on all measures when evaluated, with CatBoost (CAT) and Gradient Boosting (GB) producing the best outcomes. The best-performing model was CAT, which had the highest balanced accuracy (72.78%), AUC (87.62%), and recall (96.52%), along with an outstanding F1 score (91.96%) and precision (87.82%). With strong predictive power demonstrated by high measures such as AUC (87.51%), recall (96.82%), and an F1 score (91.96%), GB comes in second. Strong reliability was demonstrated by models such as AdaBoost (Ada), LightGBM (LGBM), and XGBoost (XGB), which also performed well with balanced accuracy above 72% and F1 scores above 91%. Random Forest (RF) is still competitive despite having somewhat lower balanced accuracy (68.95%) and AUC (83.05%). On review, CAT and GB are the most successful types, with CAT being marginally more in classification metric.

Fig. 13
figure 13

Performance comparison of homogenous classifiers.

Figure 14 shows the error in the classification of samples. CatBoost and Gradient Boosting are the most reliable homogeneous tree-based classifiers due to their low misclassification rates (13.36% and 13.4% respectively), while AdaBoost, LightGBM, and XGBoost have slightly higher error rates. Random Forest has the highest error rate of 15.04%.

Fig. 14
figure 14

The error rate of homogenous classifiers on imbalanced churn data.

Classifier performance after hyperparameter tuning on an imbalanced churn data

The performance of MLP and Gradient boosting classifiers are assessed after optimizing the parameters using the GridSearch() method. Table 8 lists the parameters tuned for the classifiers.

Table 8 Hyper parameters tuned for selected classifiers.

Figure 15 depicts AUC, precision, recall, and f1-score metrics of selected optimized classifiers under study on imbalanced churn data. All models achieve outstanding metrics, and the performance evaluation of the optimised classifiers shows very competitive results. With the highest accuracy (86.7%), balanced accuracy (73.4%), and F1 score (91.98%), as well as strong precision (88.11%) and recall (96.21%), LightGBM (LGBM) is clearly the best-performing model. These results demonstrate robust overall performance and balanced dataset handling. The next most dependable method for situations needing high sensitivity is gradient boosting (GB), which achieves high accuracy (86.6%), a slightly lower balanced accuracy (72.3%), an F1 score of 91.96%, and good recall (96.82%). With accuracy (86.4%), balanced accuracy (72.6%), and an F1 score of 91.78%, Multi-Layer Perceptron (MLP) likewise exhibits remarkable performance, demonstrating its efficacy in both precision and recall. Although it lags somewhat behind the others, XGBoost (XGB) provides competitive accuracy.

Fig. 15
figure 15

Performance comparison of optimized classifiers.

The performance metric score has improved after fine-tuning the parameters. The misclassification rate of classifiers is depicted in Fig. 16.

Fig. 16
figure 16

The error rate of optimized classifiers on imbalanced churn data.

It is observed that LGBM has a lower error rate compared to other selected classifiers. Table 9 shows the accuracy and balanced accuracy scores of optimized classifiers.

Table 9 Accuracy and balanced accuracy scores of optimzed classifiers.

Results show that the balanced accuracy score improved after hyperparameter tuning of classifiers. Figure 17 shows the significance of features revealed by the gradient boosting model. To determine the relevance of the feature, Gradient Boosting models use the weighted average of each feature’s improvement in the loss function at each split during the tree-building process. Each feature’s significance is assessed on its contribution towards lowering the performance of the model at each node. The more a feature helps, the higher its importance. Gradient Boosting identified key predictors of churn, including CreditScore, account balance, salary, and age. These insights enhance interpretability and guide targeted interventions.

Fig. 17
figure 17

Feature importance assessment on churn data.

Classifier performance after hyperparameter tuning on a balanced churn data

The Churn dataset is balanced using SMOTE technique. Table 10 shows the parameters tuned using GridSearch() method. Figure 18 depicts AUC, precision, recall, and f1-score metrics of selected classifiers under study on balanced churn data.

Fig. 18
figure 18

Performance comparison of optimized classifiers on balanced churn data.

Table 10 Hyper parameters tuned for selected classifiers.

The above comparative study selects the top-performing classifiers for further research. It is observed that the performance of boosting classifiers is appreciable. Multi-layer perceptron also performs well in churn prediction but computationally intensive. The selected classifier hyperparameters are tuned, and the classifier performance on balanced data is investigated.

Table 11 shows the accuracy and balanced accuracy scores of optimized classifiers on balanced data.

Table 11 Accuracy and balanced accuracy scores of classifiers on balanced data.

Balanced accuracy guarantees equitable assessment for all classes in imbalanced datasets, while weighted precision, recall, and F1-score offer more dependable performance metrics by taking class distribution into account. This helps prevent inaccurate inferences from unweighted measurements or standard accuracy. Table 12 shows the influence of SMOTE across the models.

Table 12 Influence of SMOTE over the classifiers.

The results show that, in all models, SMOTE (Synthetic Minority Over-sampling Technique) increases the balanced accuracy, weighted recall, and weighted F1-score, therefore proving its value in managing class imbalance. Although logistic regression (LR) without SMOTE performs badly in terms of balanced accuracy (0.593) and weighted recall (0.373), with SMOTE these values improve dramatically to 0.650 and 0.801, respectively. With GB + SMOTE the model improves the best balanced accuracy (0.758) and ADA + SMote having the highest at 0.777, Gradient Boosting (GB), XGBoost (XGB), LightGBM (LGBM), and AdaBoost (ADA) typically perform better than LR. Although weighted precision is still good across models, weighted recall is noticeably lower without SMOTE, hence underlining weak minority class predictions. Applying SMOTE produces GB + SMote and ADA + SMote as the best-performing models overall since they acquire high balanced accuracy (0.758 and 0.851) and F1-scores (0.849 and 0.847, hence more suited for unbalanced datasets.

Recommendation of optimal classifier for customer churn prediction

The results reveal that all classifiers attain relatively good accuracy, but balanced accuracy provides a clearer insight of their true performance on an imbalanced dataset. A comparative analysis of accuracy on boosting classifiers is shown in Fig. 19. While Logistic Regression (LR) has the lowest balanced accuracy (0.650), boosting-based models such as Gradient Boosting (GB), XGBoost (XGB), and AdaBoost (ADA) perform much better, with balanced accuracy values above 0.75, indicating superior handling of minority class predictions. The CatBoost (CAT) model achieves the best-balanced accuracy (0.79), implying it is the most effective at identifying both majority and minority classes fairly. LightGBM (LGBM) operates similarly to XGBoost, with a balanced accuracy of 0.755. Overall, boosting-based models beat logistic regression, and CatBoost looks to be the most balanced classifier in this scenario. However, if computational efficiency is a concern, AdaBoost is strong alternative.

Fig. 19
figure 19

Comparative analysis of performance of SMOTE+Classifiers.

AdaBoost, optimized with grid search, achieved an accuracy of 84.5% and balanced accuracy of 77.1%, making it the recommended model for churn prediction.

Conclusion

In conclusion, the growing number of private financial organizations has resulted in customers migrating away from traditional banking institutions. Therefore, financial organizations need help retaining consumers. Monitoring customer account health has become crucial in implementing preventive measures for retention. Machine learning algorithms present a promising solution for analyzing consumer status and reducing churn rates. This research conducted an extensive exploratory analysis of churn data to visualize the relationships across multiple dimensions. The study also investigated the effectiveness of classifiers of type boosting, linear, and nonlinear. The dataset used exhibited an imbalance in the churn/no churn category, which was addressed by employing SMOTE to balance the dataset. The classifiers were then put into action on the dataset on the target variable. The findings revealed a bias in the balanced dataset, which was used to assess the effect of performance measures on the imbalanced data. Notably, the Ada boosting classifier demonstrated effective performance on the balanced churn data utilized in the study. These insights highlight the importance of addressing data imbalance and applying suitable classifiers for prediction churn in the financial domain.

Future work

Future machine learning research on imbalanced datasets should investigate sophisticated sampling strategies like ADASYN and Borderline-SMOTE with hyperparameter optimisation to improve model performance. An ensemble of ensembles may provide higher forecast accuracy, and feature engineering and selection can increase efficiency. Further insights could be gained by looking into deep learning models like LSTMs and RNNs as well as cost-sensitive learning, especially in churn prediction jobs. Additionally, real-time deployment, monitoring, and enhancing model interpretability using techniques like SHAP or LIME should be prioritized. Further development in the use of balanced datasets for precise machine learning predictions may result from extending these findings to other domains such as fraud detection, medical diagnosis, or anomaly detection.