# Journal of Risk Model Validation

**ISSN:**

1753-9579 (print)

1753-9587 (online)

**Editor-in-chief:** Steve Satchell

# Performance validation of representative sample-balancing methods in loan credit-scoring scenarios

####
Need to know

- 12 of the most representative sample-balancing methods are validated quantitively.
- The paper finds that all balanced methods can meet the stability requirements of the regulatory authorities.
- It is argued that the balancing ratio needs to be properly set to achieve better performance.
- Balancing with more related variables will enhance the performance and robustness for most methods.

####
Abstract

Data sets used to construct a credit-scoring model are always imbalanced in the real world, leading to the recognition ability of the model becoming biased toward the majority and low-risk samples and away from the minority and high-risk samples. In the past few decades, many sample-balancing methods have been designed to balance the two classes of samples before modeling, but they lack sufficient performance verification, especially on large data sets. This paper quantitatively validates 12 of the most representative balancing methods. The results show that, in terms of performance, a method combining the synthetic minority oversampling technique (SMOTE) and the Edited Nearest Neighbor method is optimal, followed by the SMOTE-Tomek method, whose performance is significantly different from the other methods tested. All 12 balanced methods can maintain stability and thus meet the relevant requirements of the regulatory authorities. The performance of each credit-scoring model is also influenced by the balancing ratio and the number of variables in the data set. In general, the user needs to determine the proper balancing ratio according to the comprehensive characteristics of the scoring model and the balancing method, and a data set containing a larger number of related variables will improve the performance and robustness for most balancing methods.

####
Introduction

## 1 Introduction

Credit risk refers to the possibility that a debtor receiving credit support cannot repay the principal and interest on time and in full according to the contract. In real-world practice, it is customary for modern commercial banks to use credit-scoring models to quantitatively evaluate the credit risk of potential loan applicants. This can effectively improve the objectivity, accuracy and consistency of the evaluation process while reducing the evaluation cost and the interference of human factors. After decades of research and practice, a large number of credit-scoring models have been developed in both industry and academia, from the logistic regression and naive Bayesian models based on classic statistical theory to the decision-tree, artificial neural network (ANN) and support vector machine (SVM) models based on machine learning tools, as well as, for example, the random forest, gradient-boosted decision trees (GBDT) and extreme gradient boosting (XGBoost) models created using ensemble modeling ideas (Sohn et al 2016; Harris 2015; Liu et al 2022).

As the studies on credit-scoring models have become more sophisticated, it has gradually been discovered that the characteristics of a data set have an important impact on the evaluation performance of most models, in particular the sample imbalance characteristics that most real credit data sets have (Xiao et al 2021). Sample imbalance characteristics refer to the distribution imbalance of different class samples in data sets. In the scenario of credit scoring, default events are sporadic and, compared with the total number of credit events, the number of default samples is always much smaller than the number of nondefault samples. If the initial data set is not processed before the model is fitted, the fitted model will have a good ability to recognize the majority class samples but a poor ability to recognize the minority class samples (Li et al 2021). A biased model will recognize some high-risk minority samples as low-risk majority samples, which will increase the potential risk of any loan business. Many studies have explored possible solutions to this problem, including applying weights to different class samples for performing cost-sensitive learning in the model (Hou et al 2020; Li et al 2021), enhancing the fitting depth on imbalanced samples (Wang et al 2022a; Mushava and Murray 2022) and preprocessing the data with sample-balancing methods (Guan et al 2021).

Sample balancing methods use different strategies to adjust the distribution of different class samples. Current methods can be split into the following three categories: undersampling methods, oversampling methods and ensemble methods.

Undersampling methods are intended to generate a subset of the majority of the samples. The RandomUnderSampler algorithm is the simplest undersampling method. It works by randomly picking samples of the majority class with or without replacement. However, randomly picking samples may pick up noisy samples or samples with weak characteristics. Hence, most studies try to identify the “representative majority samples” from the perspective of the attributed information value, classification effect, etc. One popular way to do this is to make the boundaries of different class samples farther apart and clearer by deleting the majority class samples in the most similar heterogeneous sample pairs, such as in the Condensed Nearest Neighbor (Hart 1968) and Tomek Links (Tomek 1976) algorithms. A more popular way is to focus on the classification rules based on the borderline samples, as is done using the One-Sided Selection, Neighborhood Cleaning Rule (Laurikkala 2001), Near Miss and Instance Hardness Threshold (Smith et al 2014) methods.

Oversampling methods, on the other hand, carry out repeated sampling of minority samples with or without return, or generate new minority samples from the existing ones by using some specific rules. The RandomOverSampler algorithm is the simplest oversampling method and was popular in practice for decades, until Chawla et al (2002) proposed the synthetic minority oversampling technique (SMOTE) method, which creatively put forward the idea of generating new minority samples. Other varieties of derivative models and variant models were subsequently designed, such as the adaptive synthetic (ADASYN) sampling (He et al 2008), borderline-SMOTE (Nguyen et al 2011), $K$-means-SMOTE (Douzas et al 2018), SVM-SMOTE (Sitompul and Nababan 2018) and linear regression-SMOTE (LR-SMOTE) (Liang et al 2020) algorithms. Most of these methods are explored from the perspective of initially generating a sample selection, detecting borderline samples, generating a new minority sample algorithmically and applying rules, etc.

Since each method has its own advantages and drawbacks, recent works focus on the ensemble methods, which contain more steps and combine the features of both under- and oversampling methods. For example, Sain and Purnami (2015) use SMOTE to generate new minority samples, then add a Tomek Links part to discard noisy samples and build the SMOTE-Tomek model. Guan et al (2021) propose the SMOTE-ENN method, which uses a weighted Edited Nearest Neighbor (ENN) rule to detect and delete noisy majority and minority class samples after the SMOTE method has been applied. Shi et al (2022) propose a resampling algorithm based on sample concatenation (Re-SC), which transforms an imbalanced training data set in the original sample space into a concatenated data set in a new sample space to reduce the overlapping region. In addition, some papers focus on the “borderline samples” (near samples of different classes) and find some better ways to detect them (see, for example, Wang et al 2022b; Jo and Kim 2022).

Although the above methods are becoming popular in academic research and industry practice, there is still a lack of sufficient prior knowledge about which method should be prioritized, since there is limited literature on the empirical comparison of the performance of representative sample-balancing methods and on the in-depth exploration of the robustness and effectiveness of each method under different settings. According to our literature search, similar recent studies include the following: Duman et al (2012) verified the performance of different classifiers on imbalanced data sets but did not verify the sample-balancing methods; Haixiang et al (2017) discussed and compared many methods but did not apply quantitative methods for further inspection; Bi and Zhang (2018) checked several advanced sample-balancing methods with four indicators but mainly on small data sets. Unlike their work, we focus on more popular methods that are easy to apply for a larger data set and more indicators. Tarekegn et al (2021) reviewed several methods for imbalanced multi-label classification problems, but we focus only on two-class credit risk classification problems. In addition, most popular sample-balancing methods are designed on small sample data sets. However, loan business practice has accumulated a large number of samples in recent years, and the applicability of relevant methods to large sample data sets also needs to be reassessed.

Based on the above review and analysis, it is natural to ask the following questions.

- •
Of the existing representative sample-balancing methods, which are the most superior in general?

- •
Which methods are significantly different from the others?

- •
Which methods are stable enough to meet the requirements of the regulatory authorities?

- •
Which methods are robust when the balancing ratio and size of the data set are changed?

To answer these questions, we select 12 of the most representative sample-balancing methods, with the aim of providing strong evidence for their performance, differences, stability and robustness via the use of four credit-scoring models and multiple indicators, and we obtain some meaningful conclusions. This study makes the following contributions to the literature.

- (1)
It briefly summarizes 12 representative sample-balancing methods.

- (2)
Through empirical research, it discovers the optimal sample-balancing method (to the best of the authors’ knowledge; however, there appears to be little relevant research in the extant literature).

- (3)
The differences, stability and robustness of each method are explored thoroughly using empirical data.

- (4)
Combined with our research findings, some valuable practical suggestions are put forward for users.

The remainder of this paper is organized as follows. Section 2 describes in detail several targeted sample-balancing methods. In Section 3, the experimental settings, results and the corresponding analysis are presented. Finally, Section 4 states our conclusions and proposes some suggestions for users.

## 2 Methods

This section focuses on the sample-balancing methods to be validated in this paper; these are representative and widely applied in most of the existing literature.

### 2.1 RandomUnderSampler

This method is as simple as randomly picking samples from the majority class with or without replacement until the number of majority classes picked is close to the number of minority samples.

RandomUnderSampler is widely used in practice, but the randomness of its sampling also makes the information value of the balanced data set fluctuate randomly.

### 2.2 Tomek Links

A “Tomek link” between two samples $x$ and $y$ of different classes is defined as follows for any sample $z$:

$d(x,y)$ | $$ | (2.1a) | ||

$d(x,y)$ | $$ | (2.1b) |

where $d(\cdot )$ is the function that measures the distance between two samples $x$ and $y$, and $x$ and $y$ belong to different classes and are nearest in the sample space. The most widely used distance computation method is the Euclidean distance as is defined as

$d(x,y)={\left({\displaystyle \sum _{i=1}^{n}}{({x}^{i}-{y}^{i})}^{2}\right)}^{1/2}.$ | (2.2) |

where $x$, $y$ are samples with dimension $n$, respectively, and $i$ is the number of each dimension. Moreover, all the methods mentioned below use the Euclidean distance to measure the similarity between samples.

After finding such a link, the majority class sample in the link will be eliminated. This process will be repeated until all the majority class samples in a link are eliminated.

The Tomek Links method specifically removes majority samples close to the boundary of the two classes of samples, so as to make the borderline clearer. However, this method needs to calculate the distance between different sample pairs and make a global comparison, which makes it computationally expensive. In addition, users cannot set the targeted balancing ratio of the resulting data set, because the Tomek Links method will eliminate all the qualified majority class samples.

### 2.3 Near Miss

Near Miss is a representative continuously optimized method that has three versions. Version 1 selects the majority samples for which the average distance to the $n$ closest minority samples ${y}_{1},{y}_{2},\mathrm{\dots},{y}_{n}$ is the smallest:

$$ | (2.3) |

where $x$ is the selected majority sample and ${x}^{\prime}$ is a random majority sample that is not selected.

Conversely, version 2 selects the majority samples for which the average distance to the $n$ farthest minority samples ${y}_{1}^{\mathrm{f}},{y}_{2}^{\mathrm{f}},\mathrm{\dots},{y}_{n}^{\mathrm{f}}$ is the smallest:

$$ | (2.4) |

where the meaning of $x$ and ${x}^{\prime}$ is the same as that in (2.3).

Version 3 has two steps. The first is to find and keep the $k$ nearest samples for each minority sample $z$ as ${y}_{1},{y}_{2},\mathrm{\dots},{y}_{k}$ using the $k$-nearest neighbors ($k$-NN) algorithm. The second is to select $n$ majority samples for which the average distances to ${y}_{1},{y}_{2},\mathrm{\dots},{y}_{k}$ are the largest:

$\sum _{i=1}^{k}}d(x,{y}_{i})>{\displaystyle \sum _{i=1}^{k}}d({x}^{\prime},{y}_{i}),$ | (2.5) |

where $x$ is a selected majority sample, ${x}^{\prime}$ is a random majority sample that is not selected, and $d(\cdot )$ is the function that measures the distance between the two samples. Since version 3 is more advanced, it is applied in our research. The core idea of it is to find the majority samples that are most different from the adjacent samples of the minority samples. Unlike the Tomek Links method, this method does not need to compare the distances between all sample pairs, which improves the computation efficiency. Users can also set the balancing ratio of the result data set.

### 2.4 One-Sided Selection

This method first uses Tomek Links (as described in Section 2.2) to remove noisy majority class samples, then applies a $k$-NN classifier with $k=1$ to all the remaining samples, yielding a prediction result for the class of each sample. The principle of the $k$-NN method is that, for a particular sample $x$ in the sample space, the $k$ samples nearest to sample $x$ are selected, their class labels are observed and the result with the highest count by voting is selected as the predicted class label of $x$. Finally, all misclassified samples will remain, regardless of whether they belong to the majority class or the minority class.

Compared with the Tomek Links method, One-Sided Selection further maintains the majority class samples that are misclassified by the $k$-NN classifier, boosting the information value of the retained samples. At the same time, its disadvantages are that the smaller $k$ value might make the elimination of majority class samples more susceptible to noisy information, while the user still cannot control the balancing ratio of the final data set.

### 2.5 Neighborhood Cleaning Rule

The Neighborhood Cleaning Rule method focuses on cleaning the data and then condensing them. It mainly uses the union of samples rejected by the Edited Nearest Neighbor (E-NN) rule and the $k$-NN method.

Specifically, it first employs the E-NN rule to identify the noisy samples among all the majority class samples, whose class differs from the majority class of the three nearest neighbors and to add them to the collection ${S}_{n}$. Then, a $k$-NN method is applied: the three majority nearest neighbors that misclassify each minority sample are inserted into the collection ${S}_{m}$. Finally, if the initial data set is $T$, the data set ${S}_{r}$ selected after processing by the Neighborhood Cleaning Rule method is given by

${S}_{r}=T-({S}_{n}\cup {S}_{m}).$ | (2.6) |

Like the One-Sided Selection method, the Neighborhood Cleaning Rule method incorporates a data-cleaning part in the calculation process for sample balancing, which is helpful for improving data quality. Its weakness is that the setting of the intermediate process makes it impossible for the user to specify the balancing ratio of the final data set.

### 2.6 RandomOverSampler

This method is as simple as repeatedly randomly picking samples of the minority class until the number of minority class samples picked is close to that of the majority samples.

RandomOverSampler is widely used in practice, but the randomness of its sampling also makes the information value of the balanced data set fluctuate randomly.

### 2.7 SMOTE

SMOTE is an innovative method that aims to generate new minority samples to solve the sample imbalance problem. For each minority sample ${x}_{i}$ in the subset of all the current minority samples, a $k$-NN algorithm is used to find the $k$ most similar samples of the same class and randomly select one of the $k$ most similar samples, labeled ${x}_{j}$. A new minority sample ${x}_{\mathrm{new}}$ will then be generated as

${x}_{\mathrm{new}}={x}_{i}+\mathrm{gap}({x}_{j}-{x}_{i}),$ | (2.7) |

where $\mathrm{gap}$ is a random number in $[0,1]$.

After repeating the above generating process, the new minority samples generated will finally meet the preset balancing ratio requirements.

The advantage of SMOTE is that it is easy to implement, completely retains the majority class samples and makes the sample space more complete for classification by generating minority class samples. However, since it treats all minority samples equally and does not consider the class information of adjacent samples, different class samples in the same location of the sample space as sample aliasing commonly overlap, and the randomness in the generation process may generate some samples that cannot provide useful information, leading to the overfitting of credit-scoring models.

### 2.8 Variants of SMOTE

Variants of the SMOTE method have been designed to improve performance. Borderline-SMOTE is a representative variant method. It first uses the $k$-NN method to detect borderline minority samples (which are minority samples close to majority samples), and then it generates new minority samples according to

${x}_{\mathrm{new}}={x}_{i}^{\prime}+\mathrm{gap}({x}_{j}^{\prime}-{x}_{i}^{\prime}),$ | (2.8) |

where $\mathrm{gap}$ is a random number in $[0,1]$, and ${x}_{i}^{\prime}$ and ${x}_{j}^{\prime}$ are borderline minority samples instead of whole minority samples.

The most useful feature of the borderline-SMOTE method is that, unlike the SMOTE method, it learns the boundary of different class samples as accurately as possible and derives new minority samples close to the boundary, making it easier for the scoring model to identify the classification boundary.

Another representative variant method is SVM-SMOTE. Unlike the borderline-SMOTE method, $k$-NN is substituted by SVM in order to improve the ability of the method to detect and recognize the borderline samples, while the rest of the process is similar to that of borderline-SMOTE. For the dichotomous problem, SVM aims to find the hyperplane in the feature space that maximizes the interval between the two classes of samples, that is:

$\mathrm{max}\gamma \mathit{\hspace{1em}}\text{such that}yi\left({\displaystyle \frac{{w}^{\mathrm{T}}}{|w|}}{x}_{i}+{\displaystyle \frac{b}{|w|}}\right)\ge \gamma ,i=1,2,\mathrm{\dots},n,$ | (2.9) |

where ${x}_{i}$ represents sample $i$, $w$ is the weight, $b$ is the fixed intercept, and the optimal solution is obtained based on the Varangian duality. The SVM-SMOTE method can be used to find more accurate borderlines than the borderline-SMOTE method, since the classification ability of SVM is generally better than that of $k$-NN, but it is also subject to the problem of overfitting.

SMOTE-Tomek is another popular variant method, which combines the idea of new minority sample generation and clarity of borderlines. It first uses the SMOTE method to generate new minority samples according to (2.7), then employs the Tomek Links method mentioned in Section 2.2 to further refine the data set by eliminating samples in the Tomek links.

In general, the Tomek Links method could alleviate the problem that the new samples generated by SMOTE method contain some noisy data, but the downsides are that the computational complexity continues to rise, and the Tomek Links method cannot ensure that the resulting data set accurately meets the preset balancing ratio required by users.

SMOTE-ENN is a very similar method to SMOTE-Tomek. After generating new minority samples by using the SMOTE method according to (2.7), it employs the E-NN method to analyze whether the majority of the class labels of several nearest neighbor samples for each sample ${x}_{i}$ is the same as the class label of ${x}_{i}$; if not, ${x}_{i}$ will be removed.

Compared with the SMOTE-Tomek method, SMOTE-ENN has a smaller computational time and makes fewer errors in eliminating noise samples. It has been demonstrated that, with the increase in sample size, the classification error of E-NN is never more than two times worse than the Bayes classification error (Cover and Hart 1967), ie,

$$ | (2.10) |

where ${\mathrm{pe}}_{\text{E-NN}}$ is the prediction error of E-NN, ${\mathrm{pe}}_{\mathrm{Bayes}}$ is the prediction error of the Bayes method and $N$ is the sample size.

### 2.9 Adaptive synthetic sampling

The computational process of ADASYN can be divided into four steps.

The first step is to calculate the unevenness of the two classes of samples, given by

$d={\displaystyle \frac{{n}_{\mathrm{minority}}}{{n}_{\mathrm{majority}}}},$ | (2.11) |

where ${n}_{\mathrm{minority}}$ and ${n}_{\mathrm{majority}}$ are the number of minority samples and majority samples, respectively, and $d\in (0,1]$.

The second step is to calculate the number of samples that need to be synthesized as new minority samples, according to

$G=\alpha ({n}_{\mathrm{majority}}-{n}_{\mathrm{minority}}),$ | (2.12) |

where $\alpha $ is the parameter that controls the balancing ratio of the data set after generating the synthesized samples.

In the third step, for each sample ${x}_{i}$ in the minority class, the $k$-NN method is applied to find its $k$-nearest neighbors by calculating

${r}_{i}={\displaystyle \frac{{\mathrm{\Delta}}_{i}}{k}},$ | (2.13) |

where ${\mathrm{\Delta}}_{i}$ is the proportion of majority class samples in the $k$-nearest neighbor samples of ${x}_{i}$. Then the ${r}_{i}$ value of each minority sample is standardized as

${r}_{i}={\displaystyle \frac{{r}_{i}}{{\sum}_{i=1}^{{n}_{\mathrm{minority}}}{r}_{i}}},$ | (2.14) |

which sums to unity, and ${r}_{i}$ can be used as the weighting in the next step.

In the final step, the number ${g}_{i}$ of synthetic samples to be generated for each minority sample, given by

${g}_{i}={r}_{i}G,$ | (2.15) |

is calculated. Finally (2.7) is used to generate new minority samples.

ADASYN uses a mechanism to automatically determine how many synthetic samples each minority sample needs to produce, rather than synthesizing the same number of samples for each minority sample as in the SMOTE method and its variants. The disadvantage of ADASYN is that it is easily affected by outliers; if the $k$-nearest neighbor of a minority sample is a majority sample, its weight will become quite large, and more samples will be generated around it.

## 3 Experiments

This section principally compares the performance of each sample-balancing method and discusses their performance differences, stability and robustness.

### 3.1 Data set and description

In view of the increasing size of historical credit data sets available to modern commercial banks when conducting credit-scoring activities, this study selects as a large sample credit data set for analysis the credit card data set from the University of California Irvine Machine Learning Repository.^{1}^{1} 1 URL: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. The features of the data set are as follows: the credit card data set has a representative balancing ratio of 3.521, with 23 364 majority samples and 6636 minority samples. It also has 23 variables (not including the dependent variable), and the dependent variable is a binary dummy variable, which takes a value of $1$ for default (minority) and $0$ for nondefault (majority).

### 3.2 Methods and models

There are plenty of sample-balancing methods to choose from in the current literature, including both original and derivative methods. Our selection criteria are as follows:

- (1)
the method must have representative design ideas;

- (2)
it must be used widely in industry practice and academic papers;

- (3)
it must be easy to implement and have an appropriate computational time.

Guided by the reviews by Haixiang et al (2017) and Shi et al (2022), we ultimately selected 12 sampling balancing methods: RandomUnderSampler, Tomek Links, Near Miss, One-Sided Selection, Neighborhood Cleaning Rule, RandomOverSampler, SMOTE, borderline-SMOTE, SVM-SMOTE, SMOTE-Tomek, SMOTE-ENN and ADASYN. The details of the 12 methods are presented in Section 2.

In addition to the sample-balancing methods, the choice of credit-scoring model is a key factor in the evaluation performance. We select several representative models in order to fully check the performance of different sample-balancing methods. Our selection criteria are as follows:

- (1)
the model is used widely in industry practice and the literature;

- (2)
it has a reasonable computational complexity;

- (3)
the design principles of all the models vary greatly.

According to these criteria, we choose the logistic regression (logistic) model as representative of statistical models, the SVM model and back-propagation neural network (BP-ANN) model as representative of machine learning models and the extreme gradient boosting (XGBoost) model as representative of ensemble models.

### 3.3 Experimental settings and evaluation indicators

We use fivefold cross-validation to comprehensively evaluate the average performance of each sample-balancing method on different credit-scoring models. In the training period for each model we first apply all the sample-balancing methods for comparison one by one to balance the training samples, then set the distribution proportion of the two classes of samples to be roughly similar and finally build separate models based on these balanced data sets. In the testing period of each model we retain the original imbalanced data set as the benchmark to verify the generalization performance of each trained model.

On a common imbalanced data set, if the credit-scoring model predicts all the samples as the majority class, evaluation indicators such as accuracy and precision usually achieve high values, which are not conducive to the objective evaluation of the balance recognition ability of the scoring models and balancing methods. Thus, the evaluation indicators in the testing period reported in this study are defined as follows.

##### AUC.

The area under the receiver operating characteristic (ROC) curve measures the ability to distinguish between different class samples of a model. It is most suitable for imbalanced data sets. The range of $\mathrm{AUC}$ values is $[0,1]$ and it is proportional to the classification ability of the model. In order to calculate the $\mathrm{AUC}$, it is first necessary to draw the ROC curve: the horizontal axis is the false positive rate,

$\mathrm{FPR}={\displaystyle \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}},$ | (3.1) |

and the vertical axis is the true positive rate,

$\mathrm{TPR}={\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}},$ | (3.2) |

where $\mathrm{FP}$ denotes the number of majority samples wrongly classified as minority samples, $\mathrm{TN}$ denotes the number of correctly classified majority samples, $\mathrm{TP}$ denotes the number of correctly classified minority samples, and $\mathrm{FN}$ denotes the number of minority samples wrongly classified as majority samples. The $\mathrm{AUC}$ is the area enclosed by the ROC curve and the two coordinate axes. The closer the ROC curve is to the upper-left corner of the plot, the larger the value of $\mathrm{AUC}$ (ie, the better the classification ability).

##### Sensitivity.

This is also called the true positive rate, $\mathrm{TPR}$. It measures the model’s recognition of minority samples and is defined as in (3.2).

##### $G$-mean.

This measures the relative balance of the model’s performance on both majority and minority samples. The closer this value to unity, the better the classification of the sample imbalance. It is calculated as follows:

$G\text{-mean}=\sqrt{\mathrm{Sensitivity}\times \mathrm{Specificity}}$ | (3.3) |

where $\mathrm{Specificity}$ is calculated as

$\mathrm{Specificity}={\displaystyle \frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}}.$ | (3.4) |

##### $F$-score.

This is the harmonic mean of the precision and recall. A score of $1$ denotes the highest precision and a score of $0$ denotes the lowest precision. The $F$-score, $\mathrm{Precision}$ and $\mathrm{Recall}$ variables are defined as follows:

$F\text{-score}$ | $={\displaystyle \frac{2\times \mathrm{Recall}\times \mathrm{Precision}}{\mathrm{Recall}+\mathrm{Precision}}}$ | (3.5) | ||

$\mathrm{Recall}$ | $={\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}},$ | (3.6) | ||

$\mathrm{Precision}$ | $={\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}}.$ | (3.7) |

All the experiments were run in the Python 3.8 environment on a 3.60 GHz Intel Core i7-4790 eight-core processor with 8 GB RAM. The four models (logistic, SVM, BP-ANN and XGBoost) were constructed using the sklearn package in Python, and their parameters were also optimized by using a grid search method. All the sample-balancing methods adopted the built-in functions in the imblearn package to ensure the standardization and validity of the program.

### 3.4 Experimental results and analysis

#### 3.4.1 Performance evaluation results

For each credit-scoring model trained on the data set balanced by different balancing methods, we first plot the ROC curves of their predicted results in the testing period. The results are shown in Figure 1. In each graph, the horizontal axis shows the $\mathrm{FPR}$ and the vertical axis shows the $\mathrm{TPR}$. The closer the ROC curve is to the upper-left corner, the better the classification performance. Meanwhile, since the SVM model gives classification results directly rather than probability results, its ROC curve is less smooth than the other models.

It can be seen from Figure 1 that, of the four credit-scoring models, the position of ROC curve for the SMOTE-ENN method is always the highest, while the curves of One-Sided Selection and RandomOverSampler are much lower. Specifically, for the logistic model, the ROC curves of each balancing method are relatively close together, and their turning point is approximately the same. On the SVM model, it is clear that the turning points of the ROC curve for the SMOTE-ENN and SMOTE-Tomek methods are negative, while the turning points of the ROC curves of RandomOverSampler and Tomek Links are positive. On the graph for the BP-ANN model, it can be clearly observed that the ROC curve for the SMOTE-ENN method is higher than those of the other methods, and the curve of One-Sided Selection is at the lowest position. On the XGBoost model, the ROC curve of each method covers most areas of the coordinate axis space, and the SMOTE-ENN method is highest, while the One-Sided Selection, RandomOverSampler and other methods have higher valued turning points when their $\mathrm{TPR}$ values are lower.

Since the ROC curves of most methods are very similar, in order to more quantitatively compare the performance accuracy of all the models tested for each balancing method, Table 1 shows the average performance in the testing period during fivefold cross-validation of four models that trained on data sets processed by different sample-balancing methods. For every evaluation indicator of each model, the optimal value is denoted by bold type.

From Table 1 we can observe that, for the credit card data set, when four representative credit-scoring models are used in turn, the SMOTE-ENN method is superior to the other 11 sample-balancing methods for four evaluation indicators ($\mathrm{AUC}$, $\mathrm{Sensitivity}$, $G$-mean and $F$-score), which indicates that this method can best facilitate different types of credit-scoring models to recognize the granular classifiable patterns in the balanced data set.

For the other balancing methods, in terms of the overall recognition ability, training on the data set balanced by the SMOTE-Tomek method makes all the four credit-scoring models obtain a higher $\mathrm{AUC}$, $G$-mean and $F$-score than when these models are trained on the data set balanced by most other balancing methods. While the relative performance of most of the sample-balancing methods compared varied in different models, the difference in performance is not large.

In terms of the ability to recognize minority samples, the SMOTE-Tomek method achieved the highest $\mathrm{Sensitivity}$ for the four credit-scoring models. The One-Sided Selection and ADASYN methods also can support most models in achieving high $\mathrm{Sensitivity}$ values. In contrast, the RandomUnderSampler, RandomOverSampler and SVM-SMOTE methods make a weak contribution to the logistic model, with a $\mathrm{Sensitivity}$ value of about $0.5$. When using the SVM model, training on the data set balanced by borderline-SMOTE or SVM-SMOTE leads to the weakest recognition ability for minority samples. When using the BP-ANN model to balance the data set, applying the Neighborhood Cleaning Rule or ADASYN will make the $\mathrm{Sensitivity}$ in the testing period relatively low. The ability of the XGBoost model to recognize minority samples by applying the Neighborhood Cleaning Rule, Near Miss and RandomOverSampler methods is not as good as when other balancing methods are applied.

Since most sample-balancing methods show different features on different credit-scoring models, with the aim of improving our analysis we further count the average ranking of each method on all the indicators, in order to analyze their average relative performance.

(a) Logistic model | ||||
---|---|---|---|---|

Balancing method | AUC | Sensitivity | $\bm{G}$-mean | $\bm{F}$-score |

RandomUnderSampler | 0.7248 | 0.5014 | 0.6615 | 0.6157 |

One-Sided Selection | 0.7292 | 0.5262 | 0.6713 | 0.6306 |

Neighborhood Cleaning Rule | 0.7364 | 0.5031 | 0.6703 | 0.6250 |

Near Miss | 0.7164 | 0.4908 | 0.6584 | 0.6106 |

Tomek Links | 0.7365 | 0.5101 | 0.6719 | 0.6279 |

RandomOverSampler | 0.7248 | 0.5022 | 0.6616 | 0.6160 |

SMOTE | 0.7251 | 0.5018 | 0.6613 | 0.6156 |

Borderline-SMOTE | 0.7145 | 0.4885 | 0.6501 | 0.6019 |

SVM-SMOTE | 0.7211 | 0.5142 | 0.6611 | 0.6180 |

SMOTE-Tomek | 0.7623 | 0.5512 | 0.7027 | 0.6659 |

SMOTE-ENN | 0.7725 | 0.6207 | 0.7226 | 0.7074 |

ADASYN | 0.7335 | 0.5115 | 0.6705 | 0.6267 |

(b) SVM | ||||

Balancing method | AUC | Sensitivity | $\bm{G}$-mean | $\bm{F}$-score |

RandomUnderSampler | 0.6918 | 0.5221 | 0.6707 | 0.6288 |

One-Sided Selection | 0.6962 | 0.5561 | 0.6807 | 0.6462 |

Neighborhood Cleaning Rule | 0.7007 | 0.5309 | 0.6798 | 0.6395 |

Near Miss | 0.6885 | 0.5323 | 0.6695 | 0.6303 |

Tomek Links | 0.7012 | 0.5366 | 0.6813 | 0.6421 |

RandomOverSampler | 0.6904 | 0.5283 | 0.6710 | 0.6304 |

SMOTE | 0.6896 | 0.5277 | 0.6701 | 0.6295 |

Borderline-SMOTE | 0.6811 | 0.5142 | 0.6602 | 0.6171 |

SVM-SMOTE | 0.6853 | 0.5171 | 0.6643 | 0.6216 |

SMOTE-Tomek | 0.7276 | 0.5781 | 0.7117 | 0.6795 |

SMOTE-ENN | 0.7386 | 0.6287 | 0.7302 | 0.7159 |

ADASYN | 0.6993 | 0.5380 | 0.6803 | 0.6413 |

(c) BP-ANN | ||||

Balancing method | AUC | Sensitivity | $\bm{G}$-mean | $\bm{F}$-score |

RandomUnderSampler | 0.7776 | 0.6076 | 0.7054 | 0.6794 |

One-Sided Selection | 0.7668 | 0.5796 | 0.6944 | 0.6634 |

Neighborhood Cleaning Rule | 0.7673 | 0.5667 | 0.6965 | 0.6627 |

Near Miss | 0.7761 | 0.6106 | 0.7045 | 0.6793 |

Tomek Links | 0.7679 | 0.5739 | 0.6983 | 0.6657 |

RandomOverSampler | 0.7742 | 0.5975 | 0.7012 | 0.6734 |

SMOTE | 0.7754 | 0.6128 | 0.7028 | 0.6784 |

Borderline-SMOTE | 0.7637 | 0.5769 | 0.6896 | 0.6584 |

SVM-SMOTE | 0.7750 | 0.5967 | 0.6994 | 0.6716 |

SMOTE-Tomek | 0.8142 | 0.6383 | 0.7386 | 0.7157 |

SMOTE-ENN | 0.8348 | 0.7198 | 0.7711 | 0.7644 |

ADASYN | 0.7603 | 0.5579 | 0.6910 | 0.6556 |

(d) XGBoost | ||||

Balancing method | AUC | Sensitivity | $\bm{G}$-mean | $\bm{F}$-score |

RandomUnderSampler | 0.9971 | 0.9805 | 0.9762 | 0.9763 |

One-Sided Selection | 0.9974 | 0.9803 | 0.9773 | 0.9774 |

Neighborhood Cleaning Rule | 0.9970 | 0.9795 | 0.9749 | 0.9751 |

Near Miss | 0.9964 | 0.9772 | 0.9727 | 0.9728 |

Tomek Links | 0.9972 | 0.9821 | 0.9766 | 0.9767 |

RandomOverSampler | 0.9969 | 0.9795 | 0.9749 | 0.9751 |

SMOTE | 0.9973 | 0.9826 | 0.9763 | 0.9765 |

Borderline-SMOTE | 0.9971 | 0.9823 | 0.9759 | 0.9761 |

SVM-SMOTE | 0.9975 | 0.9834 | 0.9786 | 0.9787 |

SMOTE-Tomek | 0.9992 | 0.9909 | 0.9889 | 0.9889 |

SMOTE-ENN | 0.9998 | 0.9983 | 0.9971 | 0.9971 |

ADASYN | 0.9972 | 0.9810 | 0.9768 | 0.9769 |

Average | |||||

Method | Logistic | SVM | BP-ANN | XGBoost | ranking |

SMOTE-ENN | 1 | 1 | 1 | 1 | 1 |

SMOTE-Tomek | 2 | 2 | 2 | 2 | 2 |

Tomek Links | 4 | 3.75 | 8.5 | 6 | 5.5625 |

One-Sided Selection | 4 | 4 | 9.25 | 5.25 | 5.625 |

ADASYN | 5 | 4.75 | 11.75 | 6 | 6.875 |

SMOTE | 8.75 | 9 | 4.5 | 5.75 | 7 |

SVM-SMOTE | 7.75 | 11 | 6.75 | 3 | 7.125 |

RandomUnderSampler | 9 | 8.75 | 3.5 | 8 | 7.3125 |

Neighborhood Cleaning Rule | 5.75 | 5.75 | 9.75 | 10.25 | 7.875 |

RandomOverSampler | 7.75 | 7.5 | 6.25 | 10.75 | 8.0625 |

Near Miss | 11 | 8.5 | 4 | 12 | 8.875 |

Borderline-SMOTE | 12 | 12 | 10.75 | 8 | 10.6875 |

To be specific, we first rank according to the value of every evaluation indicator on each model with a descending sort, so if a sample-balancing method achieves an optimal value at a specific evaluation indicator, then its ranking is $1$ for this indicator. As an example, with the logistic model, Tomek Links yields an $\mathrm{AUC}$ of 0.7365, which is inferior to SMOTE-Tomek ($\mathrm{AUC}=0.7623$) and SMOTE-ENN ($\mathrm{AUC}=0.7725$) but better than all the other methods, then its $\mathrm{AUC}$ ranking is 3. We can continue to calculate that the rankings of Tomek Links are $3$ ($\mathrm{AUC}$), $6$ ($\mathrm{Sensitivity}$), $3$ ($G$-mean) and $4$ ($F$-score) for the four individual indicators, so its average ranking on the logistic model is $4$ (the mean of $3$, $6$, $3$ and $4$). Based on this principle, we calculate the average ranking of each sample-balancing method on all the credit-scoring models and finally obtain their total average rankings as the evaluation of their average relative performance. Table 2 shows the ranking results; the last column shows the average ranking of each balancing method for the four credit-scoring models.

As can be seen from Table 2, there is no doubt that the SMOTE-ENN method, which has achieved the optimal performance for all models and all evaluation indicators, ranks first, followed by the SMOTE-Tomek method. At the same time, the SMOTE-ENN method and SMOTE-Tomek method rank first and second, respectively, for the four models, which indicates their performance is consistent. The relative ranking of the other methods on each model fluctuates to some extent. On average, the Tomek Links method ranks third, with an average ranging value of 5.5625; One-Sided Selection ranks fourth, with an average ranging value of 5.625; the ADASYN method and the SMOTE method rank fifth and sixth, respectively. Meanwhile, the overall performance of RandomOverSampler, Near Miss and borderline-SMOTE is relatively weak, as their average rankings are lowest.

In conclusion, of all the 12 sample-balancing methods involved in the comparison, the SMOTE-ENN and SMOTE-Tomek methods achieved the best performance. This finding indicates the following.

- (1)
Compared with single-structure sample-balancing methods such as the Neighborhood Cleaning Rule, RandomOverSampler or Near Miss, the composite-structure sample-balancing methods are more likely to support the better performance of the credit-scoring model. Generally speaking, there is a certain complementary relationship between the composite operation modules, which is conducive to the balanced data set containing more valuable information for classification.

- (2)
Of the composite-structure sample-balancing methods, those methods designed to first generate new minority samples and then eliminate redundant samples, such as SMOTE-ENN and SMOTE-Tomek, are better than the methods that eliminate redundant samples first and then regenerate new minority samples, such as borderline-SMOTE and SVM-SMOTE. One possible reason for this is that the newly generated minority samples may contain some noisy information. Therefore, if the data cleaning is carried out after the generation step, the final balanced data set will be more refined.

#### 3.4.2 Differences tests

Is the performance of the SMOTE-ENN, SMOTE-Tomek and other methods statistically significant? Which methods have significant performance differences from each other? These questions need to be answered by statistical tests. In this section, two nonparametric statistical test methods are used to verify the significance of the performance difference between all the compared balancing methods.

First, the Friedman test is used to verify the significance of the performance difference between all the compared balancing methods by ranking the values of all the evaluation indicators for every balancing method applied to the four credit-scoring models. The original hypothesis of the Friedman test is that if the 12 sample-balancing methods make the same contribution to all the credit-scoring models, then their performance difference should not be statistically significant. The test results are shown in Table 3.

Statistical | |||

value | $\bm{p}$-value | Hypothesis | |

Friedman | 99.6258 | 118e$-$16 | Rejected |

According to Table 3, it is obvious that there is a statistically significant difference in performance at the 99% confidence level between the 12 balancing methods using the four evaluation indicators, with a high statistical value (99.6258).

However, while the Friedman test can confirm whether there are significant differences between multiple methods, it cannot indicate which methods have greater differences between them. Therefore, the Nemenyi test is applied and results are shown in Figure 2. If any methods are connected by a line segment, this means there is no statistically significant difference in their contribution to the performance of all the credit-scoring models.

It can be inferred from Figure 2 that the SMOTE-ENN method is statistically significantly better than the other 11 methods, since it is not connected with any line segment. In addition, the contribution of the SMOTE-Tomek method is also statistically significant, as it is not connected with any other method. Two undersampling methods, the Tomek Links method and the One-Sided Selection method, are statistically similar in their contribution to the performance of credit-scoring models, and they are significantly different from any other methods.

The borderline-SMOTE method is statistically different from all other methods, but in combination with the findings in Table 2, this may be due to its weak performance compared with other methods. The remaining methods have no statistically significant difference in performance, except for the Near Miss and ADASYN methods, which are not directly connected by the line segments.

Overall, these findings demonstrate that the SMOTE-ENN method and the SMOTE-Tomek method have a statistically significant performance advantage over the other ten methods, while the Tomek Links method and the One-Sided Selection method make the same in performance contribution but also have a statistically significant performance advantage over the other eight methods.

#### 3.4.3 Stability tests

According to the recent regulatory requirements for credit-scoring models (for example, the European Banking Authority (2017) guidelines on PD estimation), after slicing data into development samples, banks need to “ensure that the data used in risk quantification is representative of the application portfolio covered by the relevant model.… Institutions should analyse the comparability of the data used for the purpose of calculating long-run average default rates or long-run average LGDs”. In this section, we further check whether any of these balancing methods will create bias in that sense.

Since the balancing operation directly changes the distribution proportion of the two class samples in the training data set, each credit-scoring model is trained based on the balanced data set. Therefore, if bias exists, the distribution proportion of the two class samples of the prediction results of each model with the unbalanced data set in the testing period should be significantly different from the true label of the initial unbalanced training data set. To check this, the population stability index (PSI) was used.

The PSI divides the two sequences into several subregions of equal length or frequency and judges whether the two sequences obey the same distribution by comparing the distribution proportion of samples in these regions. It is calculated as

$\mathrm{PSI}={\displaystyle \sum _{i=1}^{n}}({A}_{i}-{E}_{i})\mathrm{ln}\left({\displaystyle \frac{{A}_{i}}{{E}_{i}}}\right),$ | (3.8) |

where ${A}_{i}$ is the number of samples in subregion $i$ of the actual sequence (the prediction results from the testing period), ${E}_{i}$ is the number of samples in subregion $i$ of the expected sequence (the actual samples from the initial imbalanced data set), $\mathrm{ln}$ is the natural logarithm function, and $n$ is the number of subregions.

In general, the smaller the distribution difference between the two sequences $A$ and $E$, the smaller the PSI value. If the PSI is less than 0.1, it means there is no significant population change between the two sequences. If PSI is greater than 0.1 but less than 0.2, it means there is a moderate population change between the two sequences. If PSI is greater than 0.2, it means there is a significant population change between the two sequences. Table 4 shows the average, maximum and minimum PSI of each balancing method on different credit-scoring models during the fivefold cross-validation.

(a) Logistic model | |||
---|---|---|---|

Balancing method | Mean PSI | Max PSI | Min PSI |

RandomUnderSampler | 0.0036 | 0.0067 | 0.0007 |

One-Sided Selection | 0.0117 | 0.0215 | 0.0046 |

Neighborhood Cleaning Rule | 0.0080 | 0.0120 | 0.0038 |

Near Miss | 0.0127 | 0.0179 | 0.0059 |

Tomek Links | 0.0096 | 0.0144 | 0.0041 |

RandomOverSampler | 0.0037 | 0.0068 | 0.0003 |

SMOTE | 0.0031 | 0.0049 | 0.0006 |

Borderline-SMOTE | 0.0006 | 0.0015 | 0.0000 |

SVM-SMOTE | 0.0014 | 0.0024 | 0.0001 |

SMOTE-Tomek | 0.0104 | 0.0151 | 0.0045 |

SMOTE-ENN | 0.0071 | 0.0114 | 0.0022 |

ADASYN | 0.0268 | 0.0353 | 0.0161 |

(b) SVM | |||

Balancing method | Mean PSI | Max PSI | Min PSI |

RandomUnderSampler | 0.0839 | 0.1188 | 0.0318 |

One-Sided Selection | 0.0961 | 0.1210 | 0.0808 |

Neighborhood Cleaning Rule | 0.1047 | 0.1364 | 0.0381 |

Near Miss | 0.0923 | 0.1256 | 0.0710 |

Tomek Links | 0.0947 | 0.1135 | 0.0670 |

RandomOverSampler | 0.0946 | 0.1156 | 0.0710 |

SMOTE | 0.0953 | 0.1188 | 0.0702 |

Borderline-SMOTE | 0.0752 | 0.1098 | 0.0503 |

SVM-SMOTE | 0.0776 | 0.1060 | 0.0225 |

SMOTE-Tomek | 0.1004 | 0.1300 | 0.0615 |

SMOTE-ENN | 0.1329 | 0.1769 | 0.1047 |

ADASYN | 0.1003 | 0.1243 | 0.0838 |

(c) BP-ANN | |||

Balancing method | Mean PSI | Max PSI | Min PSI |

RandomUnderSampler | 0.0546 | 0.0895 | 0.0114 |

One-Sided Selection | 0.1432 | 0.2299 | 0.1063 |

Neighborhood Cleaning Rule | 0.1047 | 0.2230 | 0.0333 |

Near Miss | 0.0450 | 0.0886 | 0.0100 |

Tomek Links | 0.0911 | 0.1966 | 0.0420 |

RandomOverSampler | 0.0578 | 0.1278 | 0.0053 |

SMOTE | 0.0657 | 0.1181 | 0.0095 |

Borderline-SMOTE | 0.0638 | 0.0966 | 0.0283 |

SVM-SMOTE | 0.0587 | 0.1133 | 0.0167 |

SMOTE-Tomek | 0.0937 | 0.1966 | 0.0039 |

SMOTE-ENN | 0.0570 | 0.1297 | 0.0209 |

ADASYN | 0.0430 | 0.0676 | 0.0074 |

(d) XGBoost | |||

Balancing method | Mean PSI | Max PSI | Min PSI |

RandomUnderSampler | 0.0005 | 0.0009 | 0.0002 |

One-Sided Selection | 0.0001 | 0.0001 | 0.0000 |

Neighborhood Cleaning Rule | 0.0000 | 0.0000 | 0.0000 |

Near Miss | 0.0010 | 0.0021 | 0.0003 |

Tomek Links | 0.0001 | 0.0002 | 0.0001 |

RandomOverSampler | 0.0000 | 0.0001 | 0.0000 |

SMOTE | 0.0000 | 0.0001 | 0.0000 |

Borderline-SMOTE | 0.0001 | 0.0002 | 0.0000 |

SVM-SMOTE | 0.0000 | 0.0000 | 0.0000 |

SMOTE-Tomek | 0.0001 | 0.0001 | 0.0000 |

SMOTE-ENN | 0.0062 | 0.0076 | 0.0051 |

ADASYN | 0.0002 | 0.0009 | 0.0000 |

Table 4 yields the following.

- (1)
When the most popular logistic regression model in the industry is applied, the average, maximum and minimum PSI of each balancing method are much less than $0.1$, indicating that for all of the compared balancing methods the distribution of their prediction results is consistent with the distribution of the original, imbalanced training data set. Therefore, in the practice of credit scoring based on the logistic regression model, the use of various balancing methods can maintain the consistency of sample characteristics and prediction results at different stages, and can meet the relevant requirements of the regulatory authorities on the premise that the model results are more easily interpretable.

- (2)
When using the SVM or BP-ANN models, the stability of each balancing method is reduced. Specifically, with the SVM model, the mean PSI of the Neighborhood Cleaning Rule, SMOTE-Tomek, SMOTE-ENN and ADASYN is within $[0.1,0.2]$, and the maximum PSI of all balancing methods is within $[0.1,0.2]$. For the BP-ANN model, the mean PSI of the One-Sided Selection and Neighborhood Cleaning Rule is in $[0.1,0.2]$, while the maximum PSI of the One-Sided Selection and Neighborhood Cleaning Rule exceeds 0.2, and the maximum PSI of Tomek Links, RandomOverSampler, SMOTE, SVM-SMOTE, SMOTE-Tomek and SMOTE-ENN is also within the range $[0.1,0.2]$. In general, for these two credit-scoring models, the average PSI values of all balancing methods show that they can maintain moderate stability, while RandomUnderSampler, Near Miss and other methods with a relatively weak contribution can maintain more constant stability.

- (3)
When the XGBoost model is used, the average, maximum and minimum PSI of each balancing method are significantly less than those for other credit-scoring models, and the difference in PSI value between the different methods becomes smaller. This finding shows that the combination of an appropriate credit-scoring model and balancing method can best maintain the stability of credit-scoring results.

Generally speaking, on average, most of the balancing methods can cause the credit-scoring model to have good stability, and the remaining balancing methods can also cause the credit-scoring model to maintain a moderate level of stability. The SMOTE-ENN and SMOTE-Tomek methods with better performance, as shown earlier, can help the BP-ANN and XGBoost models to maintain good stability and can also provide moderate stability when combined with logistic regression and SVM models.

#### 3.4.4 Robustness tests

Finally, the robustness of each balancing method is tested further. The test is divided into two parts: the performance fluctuation of different balancing methods under the setting of varying balancing ratios, and the performance fluctuation after balancing based on sample sets with different dimensions.

##### Different balancing ratios.

First, we verify the performance fluctuation of each balancing method on all credit-scoring models when different target balancing ratios are set. As four of the balancing methods (One-Sided Selection, Neighborhood Cleaning Rule, Condensed Nearest Neighbor and Tomek Links) do not help the user to specify the target balancing ratio, these methods were not included in this comparison. For the remaining balancing methods, in the target balancing ratio interval $[0.5,1]$, we take 0.05 as the step size, and we balance the training data set with each of the different target balancing ratios to fully measure the performance of each method.

Figures 3–6, respectively show the performance fluctuation in the testing period of the four credit-scoring models as trained on the data set balanced by each method.

It can be seen from Figure 3 that in the logistic model, in the process of the balancing ratio gradually increasing from 0.5 to 1, the application of ADASYN, SMOTE-ENN, borderline-SMOTE and other methods makes the $\mathrm{AUC}$, $\mathrm{Sensitivity}$, $G$-mean and other evaluation indicators lower, while the other balancing methods can increase these evaluation indicators. At the same time, except for the SMOTE-ENN method, the $F$-score of most sample-balancing methods shows an obvious upward trend as the balancing ratio increases. In addition, the fluctuation range of most methods is no more than 3%, and the fluctuation range of $\mathrm{Sensitivity}$ increases with increasing balancing ratio, while the other three evaluation indicators show stable trends.

In the SVM model (Figure 4), with increasing balancing ratio, the $\mathrm{AUC}$s of RandomOverSampler, ADASYN and borderline-SMOTE show a slow downward trend, the $\mathrm{AUC}$ of SMOTE-ENN shows an initial downward trend and then an upward trend and the $\mathrm{AUC}$s of other methods show a slow upward trend. For the other three evaluation indicators, the SVM models with each balancing method show a more obvious trend in performance improvement as the data sets become more balanced. In terms of performance fluctuation range, the fluctuations of $\mathrm{Sensitivity}$ and $F$-score are the largest, at about 5%, the fluctuation in $\mathrm{AUC}$ indicators is generally within 2% and the fluctuation range of $G$-means is the smallest (at a level of 0.5%).

In the BP-ANN model (Figure 5), the fluctuation of each evaluation indicator with a change in the balancing ratio is obviously intensified. From a trend perspective, the values of the evaluation indicators of the SVM-SMOTE method increase with the balancing ratio, while the Near Miss method shows an obvious downward trend for the four evaluation indicators. Most of the other methods slowly improve the evaluation results in the fluctuation. In general, the performance fluctuations of the SMOTE-ENN, RandomUnderSampler and Near Miss methods are larger than those of the other compared methods.

It can be seen from Figure 6 that in the XGBoost model each balancing method has a limited impact on the fluctuation in each evaluation indicator. As the balancing ratio continues to increase, the fluctuation of each evaluation indicator will never exceed 1%. For most balancing methods, increasing the balancing ratio can slightly improve the value of each evaluation indicator and maintain reasonable stability. Under different balancing ratios, the performance of the SMOTE-ENN method remains more stable than that of the other methods, while the evaluation results of the RandomUnderSampler method and the Near Miss method fluctuate a little.

It can be concluded from the above results that the setting of balancing ratios will have some impact on the performance of various balancing methods on all the credit-scoring models. Generally speaking, the fluctuation in the SVM model performance is greater than that of the other three models when applying the balancing methods, while the fluctuation in the XGBoost model performance is the smallest when the balancing methods are applied. From the perspective of sample-balancing methods, the fluctuation in model performance caused by the SMOTE-ENN, RandomUnderSampler and Near Miss methods under different balancing ratio settings is wider than of the other methods. Therefore, when using these sample-balancing methods, it is necessary to properly select the balancing ratios.

##### Different sample dimensions.

Next, we verify the performance fluctuation of each balancing method for all credit-scoring models when different numbers of independent variables are used for modeling. In the process of constructing the real-world credit-scoring model, the problem of missing variables and insufficient information is often encountered.

The data set used in the study has a total of 23 independent variables. We aim to construct credit-scoring models using between 3 and 22 independent variables to observe the effect of different balancing methods on each model. Considering that the information value of the different variables is different from in the original data set, in order to avoid the unfairness that may be caused by randomly selecting variables for modeling, we first use a relief algorithm to calculate the explanatory power of each variable for the sample’s label, and then take a data subset composed of three variables with the strongest explanatory power as the starting point, respectively adding the next variable with the strongest explanatory power from the subset of remaining variables not participating in the modeling, continuously expanding and balancing the new modeling data set and recording the model performance during the expansion process. Figure 7 shows the evaluation results for the importance of each variable obtained by the relief algorithm.

It can be seen from Figure 7 that, except for the “education” variable, all variables have a positive impact on the sample risk. The variables pay_0, PAY_2, PAY_3 have a greater influence, while PAY_AMT2, MARRIAGE and PAY_AMT3 have less influence.

Figures 8–11 show the performance fluctuation in the testing period of the four credit-scoring models as trained on the data set with different numbers of variables balanced by each method.

It can be seen from Figure 8 that, in the logistic model, with an increasing number of variables in the data set, the $\mathrm{Sensitivity}$, $G$-mean and $F$-score all show a trend of decreasing first and then increasing, and the $\mathrm{AUC}$ shows an overall upward trend. The performance of the SMOTE-ENN and SMOTE-Tomek methods is generally better than that of the other methods in most cases. When there are more than nine variables, the performance fluctuation in the logistic model for each balancing method reduces and their trend becomes more stable.

Figure 9 shows that, in the SVM model, when there are less than nine variables used for modeling, the model performance fluctuates significantly. After that, as the number of variables increases, most performance indicators tend to be stable. For most sample-balancing methods, using more variables will decrease the performance of the model, but it will also reduce the fluctuation in performance. For the $\mathrm{AUC}$ indicator, the use of more variables is beneficial in increasing the overall prediction accuracy of the model when the SMOTE-ENN method or the SMOTE-Tomek method is used. Meanwhile, for the $\mathrm{Sensitivity}$ indicator, the use of an increasing number of variables for the SMOTE-Tomek method continues to improve the recognition ability of minority samples, while the SMOTE-ENN method for the SVM model yields the best $\mathrm{Sensitivity}$ value when modeling with six variables. For the $G$-mean indicator, the evaluation results have an obvious upward trend when using the SMOTE-ENN method or the SMOTE-Tomek method, while the remainder of balancing methods perform best when using between five and seven variables. For the $F$-score indicator, the optimal performance of each balancing method is usually achieved when six to eight variables are selected.

Further, it can be seen from Figure 10 that, in the BP-ANN model, all the performance indicators show a stable upward trend with an increasing number of variables, and the marginal upward range is relatively wide when the number of variables is small, while the marginal upward range is relatively small when the number of variables is large. In general, increasing the number of variables used in the BP-ANN model when using sample-balancing methods is conducive to improving the overall performance of the model.

Finally, from Figure 11 we can see that, in the XGBoost model, with increasing numbers of variables, the use of 12 of the sample-balancing methods yields obvious upward trends in four performance indicators. Moreover, the fastest rising trend occurs by increasing the number of variables from 7 to 11, followed by increasing the number of variables from 3 to 7; when the number of variables is greater than 12, adding new variables has a limited effect on performance improvement. At the same time, compared with the other 11 balancing models, the SMOTE-ENN method can help the XGBoost model to perform better even when the number of variables used is small.

It can be concluded from the above findings that the number of variables included in the data set for sample balancing is a key factor that affects the performance of each credit-scoring model. In general, the larger the number of related variables included in the balanced data set, the greater the improvement in the comprehensive performance of each credit-scoring model trained on it. In addition, although in some cases the model trained based on the data set with a small number of related variables can obtain the best performance on specific evaluation indicators (such as the logistic regression model trained on the data set balanced by the SMOTE-ENN method with only the six most important variables), it is often difficult to determine the appropriate number of variables in the absence of prior information for users. Overall, the SMOTE-ENN method is the most optimal under various settings. Moreover, the performance of the model built on the data set containing fewer variables is more sensitive to the selection of variables than that of the model built on the data set containing more variables, so a credit-scoring model constructed with more related variables will be more robust.

## 4 Conclusions and suggestions

In this paper, we verified the performance of 12 representative sample-balancing methods using four popular credit-scoring models. Through this research we can draw the following conclusions.

- •
In terms of performance, the SMOTE-ENN method, which achieved the optimal level in all models and for all evaluation indicators, ranks first, followed by SMOTE-Tomek, Tomek Links and One-Sided Selection, while the performance of RandomOverSampler, Near Miss and borderline-SMOTE is relatively weak, as their average ranking is lowest.

- •
In terms of differences, the SMOTE-ENN and the SMOTE-Tomek methods have a statistically significant performance advantage compared with the other 10 methods, while the Tomek Links and the One-Sided Selection methods are statistically similar in their contribution to the performance of credit-scoring models, as well as being significantly different from all the other methods. Although the borderline-SMOTE method shows the weakest performance, it is statistically different from all the remaining methods, which are statistically similar to each other.

- •
In terms of stability, most of the balancing methods can make the credit-scoring model stable, and the remaining methods can also permit the credit-scoring model to maintain a moderate level of stability. In general, all the balancing methods can maintain stability and thus meet the relevant requirements of the regulatory authorities.

- •
In terms of robustness, the performance of each credit-scoring model will also be influenced by the balancing ratio and the number of relevant variables in the data set. We find the fluctuation in model performance caused by SMOTE-ENN, RandomUnderSampler and Near Miss under different balancing ratio settings is wider than that of other balancing methods. In addition, in most cases, the greater the number of relevant variables included in the balanced data set, the more beneficial the data set is for the comprehensive performance and robustness of each credit-scoring model trained on it.

Based on the above research findings, we make the following suggestions for credit institutions such as commercial banks using the sample-balancing method in their credit-scoring process.

##### Method selection in the absence of prior information.

Our research finds that the SMOTE-ENN method can yield significantly better performance than other balancing methods for various evaluation indicators and credit-scoring models, such as logistic, SVM, BP-ANN and XGBoost. Therefore, we suggest combining the use of these models with the SMOTE-ENN method. At the same time, since the performance of each sample-balancing method will be affected by the features of the data set and the credit-scoring model, we suggest that, when selecting the optimal balancing method, it is necessary to conduct sufficient performance comparison based on different balancing methods to find the optimal one. Although this takes more calculation time, the results will be more effective. In addition, although the RandomUnderSampler and RandomOverSampler methods are simple to use and popular, our empirical results show that these two methods do not contribute as much to the performance of credit-scoring models as other methods.

##### More robust methods.

Users can also simultaneously use SMOTE-Tomek, Tomek Links, One-Sided Selection and other methods with somewhat superior performance to generate multiple training data subsets with different balance characteristics and to build multiple credit-scoring submodels, in order to further improve the stability of the scoring results using the idea of ensemble modeling and scoring.

##### Other functions of the sample-balancing method.

In addition to alleviating the imbalance in modeling credit data sets, the sample-balancing method can also provide some other assistance. For example, in the practice of bank credit-scoring, models are typically trained on the majority of observations in the middle of the rating scale, but do not work well in the “corners” due to the lack of samples. With various balanced methods of generating new samples, users can generate more samples for specific customer groups, so as to carry out refined and targeted credit scoring based on effectively scaled data. In addition, with the continuous development of credit business, there may be some changes in the customer group structure and market environment, and there will always be some discrepancies between the actual test and the expected test results. Therefore, it is suggested that, when constructing the training data set, the most appropriate balancing method should be used to generate more minority samples based on the recent new loan samples, or more early loan samples should be deleted, in order to better reflect the more recent characteristics of the customer group and to reduce errors.

## Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

## Acknowledgements

The work was supported in part by the Natural Science Foundation of Jiangsu Higher Education Institutions of China under grant 22KJB630002, the Jiangsu University Philosophy and Social Sciences Research Project under grant 2021SJA0113, the Natural Science Foundation of Nanjing University of Posts and Telecommunications under grant NY221108 and the Nanjing University of Posts and Telecommunications Scientific Foundation under grant NYY221026.

## References

- Bi, J., and Zhang, C. (2018). An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowledge-Based Systems 158, 81–93 (https://doi.org/10.1016/j.knosys.2018.05.037).
- Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321–357 (https://doi.org/10.1613/jair.953).
- Cover, T., and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (https://doi.org/10.1109/TIT.1967.1053964).
- Douzas, G., Bacao, F., and Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on $k$-means and SMOTE. Information Sciences 465, 1–20 (https://doi.org/10.1016/j.ins.2018.06.056).
- Duman, E., Ekinci, Y., and Tanrıverdi, A. (2012). Comparing alternative classifiers for database marketing: the case of imbalanced data sets. Expert Systems with Applications 39(1), 48–53 (https://doi.org/10.1016/j.eswa.2011.06.048).
- European Banking Authority (2017). Guidelines on PD estimation, LGD estimation and the treatment of defaulted exposures. Document EBA/GL/2017/16. EBA, Paris. URL: https://bit.ly/3gQW5Hf.
- Guan, H., Zhang, Y., Xian, M., Cheng, H. D., and Tang, X. (2021). SMOTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Applied Intelligence 51(3), 1394–1409 (https://doi.org/10.1007/s10489-020-01852-8).
- Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications 73, 220–239 (https://doi.org/10.1016/j.eswa.2016.12.035).
- Harris, T. (2015). Credit scoring using the clustered support vector machine. Expert Systems with Applications 42(2), 741–750 (https://doi.org/10.1016/j.eswa.2014.08.029).
- Hart, P. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14(3), 515–516 (https://doi.org/10.1109/tit.1968.1054155).
- He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). ADASYN: adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE Press, Piscataway, NJ (https://doi.org/10.1109/ijcnn.2008.4633969).
- Hou, W. H., Wang, X. K., Zhang, H. Y., Wang, J. Q., and Li, L. (2020). A novel dynamic ensemble selection classifier for an imbalanced data set: an application for credit risk assessment. Knowledge-Based Systems 208, Paper 106462 (https://doi.org/10.1016/j.knosys.2020.106462).
- Jo, W., and Kim, D. (2022). OBGAN: minority oversampling near borderline with generative adversarial networks. Expert Systems with Applications 197, Paper 116694 (https://doi.org/10.1016/j.eswa.2022.116694).
- Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe, Quaglini, S., Barahona, P., and Andreassen, S. (eds), pp. 63–66. Lecture Notes in Computer Science, Volume 2101. Springer (https://doi.org/10.1007/3-540-48229-6_9).
- Li, Z., Huang, M., Liu, G., and Jiang, C. (2021). A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Systems with Applications 175, Paper 114750 (https://doi.org/10.1016/j.eswa.2021.114750).
- Liang, X. W., Jiang, A. P., Li, T., Xue, Y. Y., and Wang, G. T. (2020). LR-SMOTE: an improved unbalanced data set oversampling based on $K$-means and SVM. Knowledge-Based Systems 196, Paper 105845 (https://doi.org/10.1016/j.knosys.2020.105845).
- Liu, W., Fan, H., and Xia, M. (2022). Credit scoring based on tree-enhanced gradient boosting decision trees. Expert Systems with Applications 189, Paper 116034 (https://doi.org/10.1016/j.eswa.2021.116034).
- Mushava, J., and Murray, M. (2022). A novel XGBoost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function. Expert Systems with Applications 202, Paper 117233 (https://doi.org/10.1016/j.eswa.2022.117233).
- Nguyen, H. M., Cooper, E. W., and Kamei, K. (2011). Borderline oversampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms 3(1), 4–21 (https://doi.org/10.1504/ijkesdp.2011.039875).
- Sain, H., and Purnami, S. W. (2015). Combine sampling support vector machine for imbalanced data classification. Procedia Computer Science 72, 59–66 (https://doi.org/10.1016/j.procs.2015.12.105).
- Shi, H., Zhang, Y., Chen, Y., Ji, S., and Dong, Y. (2022). Resampling algorithms based on sample concatenation for imbalance learning. Knowledge-Based Systems 245, Paper 108592 (https://doi.org/10.1016/j.knosys.2022.108592).
- Sitompul, O. S., and Nababan, E. B. (2018). Biased support vector machine and weighted-SMOTE in handling class imbalance problem. International Journal of Advances in Intelligent Informatics 4(1), 21–27 (https://doi.org/10.26555/ijain.v4i1.146).
- Smith, M. R., Martinez, T., and Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine Learning 95(2), 225–256 (https://doi.org/10.1007/s10994-013-5422-z).
- Sohn, S. Y., Kim, D. H., and Yoon, J. H. (2016). Technology credit scoring model with fuzzy logistic regression. Applied Soft Computing 43, 150–158 (https://doi.org/10.1016/j.asoc.2016.02.025).
- Tarekegn, A. N., Giacobini, M., and Michalak, K. (2021). A review of methods for imbalanced multi-label classification. Pattern Recognition 118, Paper 107965 (https://doi.org/10.1016/j.patcog.2021.107965).
- Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics 6(11), 769–772 (https://doi.org/10.1109/TSMC.1976.4309452).
- Wang, K., Li, M., Cheng, J., Zhou, X., and Li, G. (2022a). Research on personal credit risk evaluation based on XGBoost. Procedia Computer Science 199, 1128–1135 (https://doi.org/10.1016/j.procs.2022.01.143).
- Wang, Z., Dong, Q., Guo, W., Li, D., Zhang, J., and Du, W. (2022b). Geometric imbalanced deep learning with feature scaling and boundary sample mining. Pattern Recognition 126, Paper 108564 (https://doi.org/10.1016/j.patcog.2022.108564).
- Wu, C. F., Huang, S. C., Chiou, C. C., and Wang, Y. M. (2021). A predictive intelligence system of credit scoring based on deep multiple kernel learning. Applied Soft Computing 111, Paper 107668 (https://doi.org/10.1016/j.asoc.2021.107668).

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to print this content. Please contact info@risk.net to find out more.

You are currently unable to copy this content. Please contact info@risk.net to find out more.

Copyright Infopro Digital Limited. All rights reserved.

As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.

If you would like to purchase additional rights please email info@risk.net

Copyright Infopro Digital Limited. All rights reserved.

You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.

If you would like to purchase additional rights please email info@risk.net