Introduction

The subprime crisis of 2008 triggered the most profound post-war recession, catching most policymakers and the financial community by surprise and bringing commercial banks’ financing and investment decisions into sharp focus (Cukierman 2019; Zubair et al. 2020). One of its primary causes was the flawed mortgage model that incentivised laxity that was used by financial entities in granting loans. These institutions, aware that they would not retain the mortgages, then sold them to other entities, which led to a lack of diligence in verifying applicant information (Hubbard and Navarro 2010). This resulted in the emergence of subprime mortgages, which were designed as high-risk loans that extended to borrowers who were unable to demonstrate the necessary credit scores or monthly income level corresponding to the requested credit amount (Jones and Sirmans 2019).

Consequently, financial institutions in the past two decades have prioritised implementing more effective decision-making methods for risk assessment with the goal of improving the accuracy of forecasting business failures and estimating creditworthiness. Thus, machine learning (ML) approaches have emerged as crucial tools for the financial sector (Leo et al. 2019; Quan and Sun 2024), such as by addressing the need to build automated bank fragility prediction (Shang et al. 2024) and credit scoring models from large datasets. These models aim to accurately classify cases into “good” or “bad” based on solvency ratios or estimated payment capacities (Bücker et al. 2022; Chen et al. 2016; Hussin-Adam-Khatir, Bee (2022); Nalić and Martinovic 2020).

Nevertheless, in practical applications of ML classification for financial risk management, an additional challenge called the class imbalance problem must be addressed (Niu et al. 2020; Shen et al. 2019) since most financial datasets exhibit a vastly greater number of solvent examples (majority class) than insolvent examples (minority class), resulting in financial-related datasets that are often strongly imbalanced (Hussin-Adam-Khatir, Bee (2022); Kennedy et al. 2010). Therefore, classification results tend to be skewed due to a bias towards the majority class (Shen et al. 2019), leading to poor performance of classifiers in identifying examples of the minority class (Niu et al. 2020; Wang et al. 2015). Notably, misclassifying an insolvent case as a solvent incurs a higher cost in risk management than missing out on an opportunity (Hussin-Adam-Khatir, Bee (2022); Shen et al. 2019).

On the other hand, ML approaches for financial assessment must maintain a balance between accuracy and interpretability (Florez-Lopez and Ramon-Jeronimo 2015; Hayashi 2016). Interpretability refers to a model’s ability to provide information in a human-comprehensible form. This aspect holds significance because of both commercial and legal considerations, where financial managers need to understand the information received to combine it with their expert judgement for more accurate financial risk evaluation. Additionally, it is crucial for scenarios such as explaining to an applicant why their credit request was rejected (Bücker et al. 2022; Florez-Lopez and Ramon-Jeronimo 2015; Hayashi 2016). Similarly, in adherence to banking regulations and audit requirements in many countries, financial institutions are required to justify their decisions regarding accepting or denying finance requests (Bücker et al. 2022; Florez-Lopez and Ramon-Jeronimo 2015; Hayashi 2016; Tomczak and Zięba 2015).

In this sense, symbolic algorithms based on decision trees (DTs) and rule systems (Apté and Weiss 1997) in the final form of IF-THEN statements are the most commonly used methods for building expressive and human-readable representations of knowledge (Wu and Hsu 2012). Unlike neural networks (NNs) and support vector machines, which do not provide insight into how to generate their predictions (i.e., they are black box methods) (Lantz 2013), rule solutions can be adequately incorporated for decision-making processes requiring the utmost clarity (Wu and Hsu 2012) and direct applicability in contexts such as financial risk management (Florez-Lopez and Ramon-Jeronimo 2015). However, the interpretability and conciseness of extracted rules pose a critical compromise (Hayashi 2016); a large set of rules or a higher average number of antecedents per rule results in more complex and less concise rules, diminishing interpretability and the ability to generalise from observed data to unseen data (a phenomenon known as overfitting) (Hayashi and Oishi 2018; Ying 2019). Hence, the simplification of extracted rules becomes crucial for enhancing interpretability in the decision-making process (Cano et al. 2011; Gacto et al. 2011; Lanzarini et al. 2017; Hayashi 2016), reducing the effort needed to understand their meaning (Gacto et al. 2011).

From this perspective, numerous research efforts have been implemented to assess financial risk using strategies to address the adverse effect of class imbalance on the predictive power of ML approaches; however, overall performance considering the trade-off between accuracy and interpretability has not been sufficiently addressed (Chen et al. 2024). Therefore, we present an overview of recent ML studies that have addressed the research gap in dealing with class imbalance problems in financial data and integrate rule solutions to improve the interpretability of the models.

Literature review

Many studies have developed ensemble methods by training multiple models and combining their predictions to improve performance in classifying financial risks from imbalanced datasets (He et al. 2018; Xia et al. 2020; Zhang et al. 2018). For example, Florez-Lopez and Ramon-Jeronimo (2015) proposed an ensemble model based on DTs as base learners, creating a correlated-adjusted decision forest (CADF) univariate to yield an accurate and comprehensible classification model for credit risk evaluation. The ensemble strategy involved merging four individual DT models from a single dataset. Feature and instance diversity were included via different wrapper-feature selection processes for each inductive model. A 10-fold cross-validation (where the dataset is randomly split into 10 mutually exclusive equal subsets for 10 training and testing sessions) was employed for DT construction. Additionally, bootstrapping (by sampling n instances uniformly from the data with replacement for training and using the remaining instances for testing) was implemented for out-of-sample validation. A penalty function was also introduced to generate adjusted-weighted votes using a mixed accuracy-correlation ranking scheme. CADF univariate was tested on the German credit risk dataset from the UCI repository. Comparative evaluations of predictive accuracy and interpretability against each DT classifier used to build the ensemble model, and other decision forest strategies revealed that CADF univariate outperforms any single classifier in terms of out-of-sample accuracy and emerged as the most interpretable among complex decision forest models.

Similarly, Hayashi et al. (2016) proposed another ensemble approach to increase the conciseness of extracted rules for automated financial risk models. They employed a recursive ML algorithm to extract classification rules (Re-RX) from a feedforward NN (Setiono et al. 2008), replacing the C4.5 DT classifier (Quinlan 1993) with the unique variant J48graft from the WEKA workbench (Panigrahi and Borah 2018). Experiments were conducted in six two-class financial datasets, which considered discrete variables before analysing continuous data. Re-RX with J48graft resulted in a smaller and more accurate set of extracted rules than the Re-RX algorithm for all the databases. Later, Hayashi extended his ensemble strategy by including the selection of continuous attributes (Continuous Re-RX) and sampling selection techniques (Sampling Re-RX) to achieve higher accuracy and interpretability (Hayashi 2016). The effectiveness and appropriateness of the Re-RX family algorithm were assessed in four real-life, two-class mixed (discrete and continuous attributes) financial datasets. The findings suggest that Continuous Re-RX, Re-RX with J48graft, and Sampling Re-RX comprised powerful management tools for creating accurate, concise, and interpretable decision support systems for financial risk analysis.

As an extension, Hayashi and Oishi (2018) proposed a straightforward two-stage sequential ensemble classifier to achieve a well-balanced rule extraction method that prioritises high accuracy while generating a concise number of rules. This approach employs a backpropagation NN alongside Continuous Re-RX with J48graft via recursive feedback. Experiments were performed on two mixed financial datasets, demonstrating that the proposed ensemble method represented the best trade-off solution, that offers both accuracy and interpretability.

Lanzarini et al. (2017) employed a hybrid ML approach based on a competitive learning vector quantisation (LVQ) NN (Kohonen et al. 2001) with particle swarm optimisation (PSO) (Wang et al. 2007) to establish classification rules. The method’s performance was tested using a real consumer credit dataset against the C4.5 and PART (Witten et al. 2011) classifier algorithms. Across ten independent runs, the LVQ network of 30 neurons + PSO showed a slightly lower accuracy than benchmark models but with a significantly lower average number of rules and antecedents.

Conversely, Wu and Hsu (2012) proposed an enhanced decision support model (EDSM) incorporating a relevance vector machine (RVM) (Tipping 2000) and DT for analysing financial rating domains. Their ensemble strategy employed a de-compositional approach that deconstructed the RVM structure to obtain relevance vectors and predicted labels based on the best cross-validation result. These outputs were then fed into a rule-based classifier with explanatory capabilities, enabling it to understand and leverage the insights from the RVM. According to the results of the final rankings, the EDSM outperformed six other classifiers in forecasting solvent rating status using real data from the Taiwan financial system.

Similarly, another study used a Taiwanese customer dataset and proposed the SPR‐RIPPER hybrid model (Xu et al. 2018). This model combines the RELIEF method (Kira and Rendell, 1992) to remove redundant features, enhancing model interpretability. Additionally, it utilises the synthetic minority class oversampling technique (SMOTE) (Chawla et al. 2002) to address the class imbalance by resampling minority classes with an increase of 100% in the training dataset and the RIPPER algorithm (Cohen 1995) for rule extraction. SPR-RIPPER exhibited higher precision, lower time complexity, and fewer rules extracted than base symbolic classifiers such as C4.5 and RIPPER.

In the context of financial distress prediction, Kristóf and Virág (2022) evaluated ML approaches for reliably predicting the failure risk of the central banks of the 27 countries in the European Union. They applied logit, C5.0 DT (Boosting C4.5) (Quinlan 1996), and a deep learning NN method to 32,287 bank-year observations. The C5.0 generated 100 different rule sets from 100 boosted DTs with depth values between 4 and 12. Notably, the ensemble DT method outperformed the other ML techniques in terms of predictive power.

Table 1 summarises the findings of the studies mentioned above regarding accuracy, interpretability, and dataset characteristics used for evaluating the performance of the proposed approaches. Additionally, we computed a complexity value (Eq. 1) for a rule-based classifier introduced by Nauck (2002), who proposed an interpretability measure that associates the number of classes and antecedents per rule,

$${complexity}=\frac{m}{{\sum }_{i=1}^{r}{c}_{i}}$$
(1)

where \(m\) is the number of classes, \(r\) is the number of rules, and \({c}_{i}\) is the number of antecedents used in the \(i\)-\({th}\) rule. Therefore, a higher value indicates a less complex and more comprehensible rule system since the classifier contains fewer rules and antecedents (Cano et al. 2011).

Table 1 Results and datasets of previous studies on ML for financial risk management.

The experimental results of Chen et al. (2024) showed that class imbalance has a negative effect on the interpretation performance of ML approaches, which is consistent with our review, since for datasets with higher imbalance rates (EU-27 banks and Taiwan; see Table 1), the compressibility of models (complexity value) also decreased.

Therefore, we propose a novel approach to rule extraction for financial risk management that addresses the class imbalance problem with a high imbalance ratio (> 10) while considering the trade-off between accuracy and interpretability.

Methods

Financial dataset

In this study, we used the US database provided by the Federal Deposit Insurance Corporation that is publicly available online (Serrano-Cinca, Gutiérrez-Nieto (2013)). This strongly imbalanced dataset (Marqués et al. 2013) consists of financial accounting statements from 8292 banks segmented into 319 insolvent and 7973 solvent cases, resulting in an imbalance ratio of 24.99. The dataset considered 17 financial ratios as independent variables, aiming to cover the key indicators for diagnosing banking financial health. Table 2 presents each ratio along with its definition.

Table 2 Financial ratios employed for banking distress analysis. Acronyms and definitions are taken from the Federal Deposit Insurance Corporation.

REMED

REMED is a symbolic one-class approach for binary classification introduced by Mena and Gonzalez (2009). The algorithm is strategically designed to address two key aspects: 1) addressing the class imbalance problem by constructing biased models aimed at recognising a target class by training both classes and 2) producing interpretable and concise rule-based systems that comprise only one rule with m antecedents for predicting the target class and another default rule without antecedents for predicting the default class. To achieve both aims, the REMED learning process unfolds in three main stages: 1) selection of antecedents, 2) selection of initial partitions, and 3) building the rule system.

Selection of antecedents

In the first stage (Algorithm 1), REMED estimates the probability \(p\) of an independent continuous variable \(X\) associated with the target class using simple logistic regression (Hosmer and Lemeshow 1989). Since the outcome is a probability, the dependent variable is bounded between 0 and 1. In logistic regression, a logit transformation (Armitage et al. 2008) is applied to the odds ratio, i.e., the probability of success divided by the probability of failure. The logistic function (Eq. 2) is represented by:

$$p=\frac{1}{1+{e}^{-({\beta }_{0}+{\beta }_{1}X)}}$$
(2)

where coefficients \({\beta }_{0}\) and \({\beta }_{1}\) are estimated for each variable \(X\) using the maximum likelihood function (Aldrich 1997). Thus, an odds ratio of 1 indicates non-association, a ratio greater than 1 indicates a positive association (where an increase in \(X\) leads to an increase in \(p\)), and a ratio less than 1 indicates a negative association (where a decrease in \(X\) leads to an increase in \(p\)). \(X\) is considered a rule antecedent only if a statistically significant association is at a confidence level > 99%. Depending on the previously established association (positive or negative), it is possible to determine the relational operator (≥ or ≤) used to partition \(X\) within the feature space.

Algorithm 1

Selection of antecedents.

Selection of Antecedents (dataset, variables)

antecedents

confidence_level ← 1-α // > 99%

ε ← 1/10k // convergence level

FOR x variables DO

X […] ← dataset[x] // instances for each continuous variable x

p, odss_ratioLogistic Regression (X […], ε

IF p < (1 – confidence_level) THEN

antecedents \(\mathop{\leftarrow }\limits^{\,\cup }\)x, odds_ratio

END-IF

END-FOR

Selection of initial partitions

Symbolic classifier learning is based on a DT scheme that divides the feature space from the top (root) to the bottom (leaf) until each instance is assigned to a unique class. Consequently, partitions are a set of exhaustive and excluding conditions for building a symbolic rule, which exhaustively classifies all instances by assigning them only one class (excluding). In this phase, REMED begins by sorting the values of an antecedent \(X\) in ascending order, computing its mean value \(\bar{X}\), and then moving towards either the start (indicating negative association) or the end (indicating positive association) of \(X\left[\ldots \right]\), based on the odds ratio, to find the closest value to \(\bar{X}\) that belongs to the target class. Subsequently, REMED computes the average between the selected \(X\) value and its predecessor (in the case of positive association) or successor (in the case of negative association). This new estimation is performed only once for each antecedent because other displacements to calculate a new partition could include at least one instance of the target class instance on the opposite side of the classification threshold, thereby reducing the probability of belonging to the aim class. Algorithm 2 provides a detailed step-by-step description.

Algorithm 2

Selection of initial partitions.

Selection of Initial Partitions (dataset, antecedents)

partitions

FOR x, odds_ratio antecedents DO

X […] ← Sort Ascending (dataset[x]) // instances sorted ascending for each antecedent x

partAverage (X […]) // mean of the values of X

pointerPosition (X […], part) // searching X value closest to \(\bar{X}\)

kpointer

WHILE X [k].class ≠ target class // searching the closest value to \(\bar{X}\) belonging to the target class

 IF odds_ratio > 1 THEN

kk + 1 // positive association

 ELSE

kk – 1 // negative association

END-IF

END-WHILE

IF pointerk THEN

IF odds_ratio > 1 then

part ← (X [k].value + X [k-1].value) / 2 // positive association

ELSE

part ← (X [k].value + X [k + 1].value) / 2 // negative association

END-IF

END-IF

partitions \(\mathop{\leftarrow }\limits^{\cup }\) part // set of initial partitions

END-FOR

Building the rule system

After generating the initial partitions for each of the m antecedents, REMED builds a straightforward rule system with m conditions as follows:

IF 1<relation> part1 AND i<relation>parti AND …. AND m<relation>partm then => target class

ELSE => default classwhere <relation> is either ≥ or ≤, depending on whether antecedent i is positively or negatively associated with the target class using parti (the classification threshold).

In the final stage (Algorithm 3), REMED classifies instances using the initial system of rules. Subsequently, it aims to improve predictive performance by adjusting classification thresholds using the bisection method (Burden and Faires 2000). Initially, REMED defines the first searching interval for potential new partitions based on partition \(i\) and the maximum or minimum value (depending on association) for instances of antecedent \(i\). The algorithm then builds a temporary rule system by modifying the current partition \(i\) with the new partition value, classifying the instances again, and retaining the new partition only if it decreases the number of incorrectly classified default class examples without decreasing the number of correctly classified target class examples. This step is repeated for each antecedent until the established convergence level is reached or when the current rule system no longer reduces the number of incorrectly classified default class instances. A detailed outline divided into instruction blocks (A-E) is as follows:

(A) REMED begins by constructing an initial rule system based on the current partitions, classifying instances, and then saving the results. REMED also saves the number of correctly classified (k1) and predicted (k2) target class instances.

(B) Later, REMED begins an iterative process \((1\ldots m)\) to improve the predictive value of each partition. It estimates a new partition for antecedent \(i\) by averaging its initial classification threshold with the maximum or minimum value of the examples for this antecedent (based on association). REMED saves a copy in the copy_partitions vector to evaluate the classification performance of the new partition.

(C) REMED creates a temporary rule system by changing the current partition of antecedent i with the new partition and classifies examples again. REMED saves the number of correctly classified (k3) and predicted (k4) target class instances.

(D) REMED compares the results obtained with the new classifier. If the number of correctly classified target class examples decreases (k3 < k1), then REMED sets the current partition as the maximum interval value to estimate a new partition; otherwise, if the number of incorrectly classified default class examples decreases (k4 < k2), then REMED saves the number of predicted target class examples (k5) and sets the current partition as the minimum interval value to estimate a new partition. REMED continues estimating new partitions for antecedent i using the bisection method until the difference in absolute value between the maximum and minimum interval values does not overcome the established convergence level or until the current rule system no longer decreases the number of incorrectly classified default class instances.

(E) If the new partition for antecedent \(i\) improves the predictive value, it is included in the final set of partitions, and the number of predicted target class instances is updated (k2 = k5). This process is repeated for all m antecedents.

Algorithm 3

Building the rule system.

Therefore, the main goal of REMED is to maximise the classification performance of the target class. This begins with the selection of antecedents strongly associated with the aim class (logistic regression confidence level > 99%), which is the unique parameter needed; later, the search for partition thresholds upon encountering the first target class example (to prevent a decrease in the probability of belonging to the aim class) stops, and finally, the predictive performance of the rule system tries to make improvements without compromising the number of correctly classified target class examples.

J48

J48 is a Java-implemented version of C4.5 (Quinlan 1993), available through the WEKA workbench. C4.5 is a widely recognised symbolic classifier used for financial risk management problems (Martens et al., (2007); Cano et al. 2011; Brown and Mues 2012; Florez-Lopez and Ramon-Jeronimo 2015; Tomczak and Zięba 2015; Hayashi 2016; Hayashi et al. 2016; Lanzarini et al. 2017; Hayashi and Oishi 2018; Xu et al. 2018). C4.5 is a discrimination-based approach that can handle multi-class problems, generating a DT with class membership predictions for all the instances. The tree-building process employs a partition selection criterion known as information gain, an entropy-based metric that measures the purity between a partition and its sub-partitions. Employing a recursive procedure, C4.5 selects antecedents that lead to purer child nodes (where a completely pure node includes examples belonging solely to one class) at each step. The gain ratio assists in identifying the best target antecedent. After the DT is built, C4.5 applies a pruning strategy to prevent overfitting (Bramer (2002)).

JRip

JRip is a Java-based version of the RIPPER algorithm (Cohen 1995), also provided by the WEKA workbench. RIPPER is another symbolic classifier widely used in the early detection of financial risk that induces sets of classification rules (Cano et al. 2011; Sánchez-Garreta et al., (2012); Tomczak and Zięba 2015; Berka 2016; Obermann and Waack 2016; Xu et al. 2018; Otieno et al. 2020). Although RIPPER can handle multi-class problems, its learning process for binary classification tasks is particularly interesting. RIPPER employs a divide-and-conquer approach to iteratively build rules to cover previously uncovered training examples (generally target class examples) in growing and pruning sets. Rules are grown by adding conditions until each rule encompasses only a single example in the growing set, often from the default class. Thus, RIPPER usually generates rules starting from the target class and extending to the default class, offering an efficient method of learning rules specifically for the target class.

Performance assessment

To assess the accuracy performance of the rule-based systems generated by each symbolic classifier, we constructed a confusion matrix containing the predicted and actual values for binary classification (Table 3), where TP represents the number of true positives (instances correctly predicted as good), FP represents false positives (bad predicted as good), FN represents false negatives (good predicted as bad), and TN represents true negatives (bad predicted as bad).

Table 3 Confusion matrix for binary classification.

Most studies dealing with imbalanced data typically denote the target class (minority) as positive and the default class (majority) as negative. However, in the financial context, labelling bad cases as positive and good cases as negative may seem unreasonable (Tomczak and Zięba 2015). Consequently, y = 1 usually denotes a good or solvent case, and y = 0 denotes a bad or insolvent case.

Another critical issue when dealing with class imbalance is how to satisfactorily assess classifier performance since using the typical accuracy rate (AR) (Eq. 3) can lead to misleading conclusions, as it measures only the overall percentage of correctly classified instances but does not evaluate the predictive performance for each class:

$${AR}=\frac{{TP}+{TN}}{{TP}+{FP}+{FN}+{TN}}$$
(3)

Therefore, it is also necessary to determine the appropriate way to evaluate classifiers on datasets with uneven distributions; thus, we also used Precision measure (Eq. 4) to assess the predictive value for identifying solvent cases:

$${Precision}=\frac{{TP}}{{TP}+{FP}}$$
(4)

Moreover, the geometric mean (Gmean) (Eq. 5) serves as a performance criterion to mitigate the impact of unbalanced data (Tomczak and Zięba 2015). It is considered a balancing measure between the correct classification of the positive and negative classes when evaluated separately (Tomczak and Zięba 2015):

$${Gmean}=\sqrt{\frac{{TP}}{{TP}+{FN}}\times \frac{{TN}}{{TN}+{FP}}}$$
(5)

Another weakness of AR is that it ignores the misclassification cost (bad classified as good, or good classified as bad), which is an important issue in financial risk assessment. In this context, a binary classifier can incur two types of errors: Type I error (FP) when an insolvent case is classified as a solvent and Type II error (FN) when a solvent case is classified as an insolvent. Since the former can imply a loss of capital and the latter can imply only a missing business opportunity, the cost of Type I error is typically considered higher for financial institutions. Therefore, we also included the Type I error or the FP rate (Eq. 6) as another metric to assess the accuracy performance of the symbolic classifiers.

$${Type\,Ierror}=\frac{{FP}}{{FP}+{TN}}$$
(6)

With respect to interpretability performance, we used criteria aimed at increasing the simplicity of the rule system. Following the principle of Occam’s razor, the best model constructs solutions with the smallest possible set of elements (Gacto et al. 2011). Thus, we computed complexity values (Eq. 1) as an interpretability measure related to the number of rules and antecedents per rule yielded by each symbolic classifier.

To prevent potential overfitting, we conducted a 10-fold cross-validation repeated five times with different seeds for all the classifiers. For the statistical comparison of the ML models, we used the nonparametric Friedman test and the Nemenyi post hoc test to determine significant differences among multiple classifiers and the Bonferroni–Dunn test to compare the performance of alternative approaches (J48 and JRip) with that of REMED as a control classifier (Demšar 2006). We ranked the k symbolic classifiers according to the best-performing algorithm in terms of the N metrics used for assessing both accuracy and interpretability. A trade-off rank average was computed for each classifier, and a significance level of p < 0.05 was set for significant differences among the performances of all symbolic classifiers and pairwise comparisons with REMED.

Results and Discussion

The experimental results are summarised in Tables 47 and Figs. 1 and 2. Table 4 presents the confusion matrix for each classifier, allowing the calculation of their respective performance metrics. Additionally, Table 5 shows the rule systems yielded by REMED and JRip for estimating interpretability measures. Due to the large size of the DT generated by J48, only the number of antecedents per rule was indicated. REMED, as anticipated, only selected attributes significantly associated with the target class (p < 0.0001) and with the highest confidence level (> 99.99%); consequently, this allowed the generation of a more concise set of rules utilising fewer antecedents (12 in total) for determining a client’s insolvency and subsequently earned the best performance in terms of complexity indicator, unlike JRip and J48, which yielded multiple rules, particularly J48 with an excessive number. Furthermore, when REMED addresses the issue of imbalanced classes, the rule system tends to identify most instances in the insolvent class.

Table 4 Confusion matrix of the classifiers.
Table 5 The rule system of the symbolic classifiers.
Table 6 Individual performance and rank of the classifiers for each assessment metric.
Table 7 Statistical comparisons of classifiers.
Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Bar plot comparing the classification performance of the classifiers.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Bar plot comparing the interpretability performance of the classifiers.

Figures 1 and 2 compare the classifiers in terms of accuracy and interpretability, highlighting the superior performance of each approach concerning specific measures. A logarithmic scale was used to present the results more compactly for interpretability metrics. Notably, REMED is the best classifier in terms of Gmean, Precision and Type I error, which are three widely recognised metrics for assessing performance in class imbalance issues. An exception arises in AR, where REMED displays the lowest indicator among the ML models. Nonetheless, this behaviour is common, and the literature reports that solution proposals with superior performance in unbalanced classes may exhibit a decline in AR rates (Chao et al. 2022).

Table 6 shows the results of ranking the classifiers for each metric separately. Rank 1 represents the best-performing approach, rank 2 represents the second-best approach, and rank 3 represents the worst-performing approach. Higher values indicate better performance for accuracy metrics, except for Type I error (insolvent classified as solvent). In the case of interpretability measures, the best values are the lowest number of rules and average antecedents, along with the highest complexity indicator. The average rank was calculated by considering the trade-off between accuracy and interpretability.

Table 7 reports the results of the statistical tests used to detect significant differences among the average ranks of classifiers. The nonparametric Friedman test was applied to rank the k = 3 classifiers over the N = 7 metrics, yielding a statistic = 4.492 distributed according to the F distribution with degrees of freedom k − 1 = 2 and (k − 1)(N − 1) = 12. The critical value of F(2,12) for a significance level of p = 0.05 was 3.885. Therefore, we rejected the null-hypothesis that all the classifiers were equivalent.

Later, we proceed with the Nemenyi post-hoc test to identify significant differences between the best-performing classifier (REMED) and the worst (J48). This was determined by examining their average ranks, which differed by at least a critical difference (CD), given by:

$${CD}={q}_{\alpha }\sqrt{\frac{k(k+1)}{6N}}$$

where the critical value of qα, which is based on the studentised range statistics (Demšar 2006), at α = 0.05 with k = 3 classifiers, was 2.343. Hence, the corresponding CD was computed as 2.343\(\sqrt{\frac{3\,\cdot\, 4}{6\,\cdot\, 12}}\) = 1.252; therefore, as the difference between the average ranks (2.714 – 1.429 = 1.285) was greater than the CD (1.252), we conclude that the post-hoc test had sufficient power to detect significant differences (p < 0.05) among the classifiers. Furthermore, upon comparing them, we identified two additional groups of algorithms: 1) J48 shows no significant difference from JRip, and 2) JRip shows no significant difference from REMED.

Finally, to assess the significant performance of REMED as a control approach versus the other symbolic classifiers, we calculated the CD for the Bonferroni–Dunn test using the same equation as for the Nemenyi test but assigned a critical value for α/(k-1) (Demšar 2006). We find that for q0.05 = 2.241 and k = 3 classifiers, the corresponding CD was 2.241\(\sqrt{\frac{3\,\cdot\, 4}{6\,\cdot\,12}}\) = 1.198, which indicated that REMED performed significantly better than J48 (2.714–0.429 = 1.285 > 1.198). However, there were no significant differences with respect to JRip (1.857 – 1.429 = 0.428 < 1.198).

Nevertheless, REMED was shown to be an unbiased symbolic classifier, suggesting its potential to identify insolvent cases, which have the highest erroneous classification cost. Moreover, nonparametric tests provide a more suitable statistical comparison among classifiers because these do not assume normal distributions or homogeneity of variance and can be applied to any evaluation measure (Demšar 2006). Therefore, including performance ranks for accuracy and interpretability metrics broadens the scope of statistical testing to find the best compromise solution.

On the other hand, cost-sensitive (Elkan 2001) and resampling strategies (Chawla et al. 2002) require constructing a cost matrix and determining under/oversampling rates (before running the algorithm) to address the class imbalance problem. In this sense, the fact that REMED requests a single parameter (confidence level for attribute selection) allows a more automated ML process.

Conclusions, Limitations and Future Work

ML for financial risk management has garnered substantial interest in the sector to improve the accuracy of forecasting banking failures and estimating creditworthiness.

However, the joint impact of the class imbalance problem and the dilemma of accuracy gain by loss of interpretability in ML approaches have not been widely studied, which constitutes a relevant gap in research for predicting bank failure and credit scoring models.

This study aimed to assess the performance of REMED, a symbolic classifier, in the context of financial risk prediction, using imbalanced datasets and considering a trade-off between accuracy and interpretability. A comparative analysis was conducted against two well-known rule-generating approaches, J48 and JRip, using a dataset provided by the Federal Deposit Insurance Corporation.

The experimental results showed that REMED performed as a better and more direct ML method to address the problem of improving predictive accuracy from imbalanced financial data without affecting model interpretability. Furthermore, our study addresses a key research gap by examining how the class imbalance problem can affect interpretability performance, especially at extreme imbalance ratios.

Conversely, a notable limitation of this study is that the performance comparison was based on a single dataset of banking crisis analysis indicators. However, the selected sample collected a large amount of real data from 8292 banks represented by 17 independent financial attributes, which can reduce the effect of biased variance estimations due to dependencies between examples of a unique dataset.

A possible avenue for future research is to explore combining REMED with an oversampling rate to increase the representation of the target class or include cost ratio parameters to reduce misclassified examples instead of classification errors, as well as expanding the experimental framework to encompass a broader collection of datasets.