Predicting Credit Scores with Boosted Decision Trees

João A. Bastos

Lisbon School of Economics and Management (ISEG) and CEMAPRE/REM, Universidade de Lisboa, 1200-781 Lisboa, Portugal

Forecasting 2022, 4(4), 925-935; https://doi.org/10.3390/forecast4040050

Submission received: 14 October 2022 / Revised: 6 November 2022 / Accepted: 16 November 2022 / Published: 17 November 2022

(This article belongs to the Section Forecasting in Economics and Management)

Abstract

Credit scoring models help lenders decide whether to grant or reject credit to applicants. This paper proposes a credit scoring model based on boosted decision trees, a powerful learning technique that aggregates several decision trees to form a classifier given by a weighted majority vote of classifications predicted by individual decision trees. The performance of boosted decision trees is evaluated using two publicly available credit card application datasets. The prediction accuracy of boosted decision trees is benchmarked against two alternative machine learning techniques: the multilayer perceptron and support vector machines. The results show that boosted decision trees are a competitive technique for implementing credit scoring models.

1. Introduction

The accurate assessment of consumer credit risk is of uttermost importance for lending organizations. Credit scoring is a widely used technique that helps financial institutions evaluate the likelihood that a credit applicant defaults on a financial obligation, and decide whether to grant credit or not. The precise judgment of the creditworthiness of applicants allows financial institutions to increase the volume of granted credit while minimizing potential losses. The credit industry has experienced tremendous growth in the past few decades [1]. The increased number of potential applicants impelled the development of sophisticated techniques that automate the credit approval procedure and supervise the financial health of the borrower. The large volume of loan portfolios also implies that modest improvements in scoring accuracy may result in significant savings for financial institutions [2].

The goal of a credit scoring model is to classify credit applicants into two classes: the “good credit” class that will likely reimburse the financial obligation, and the “bad credit” class that should be denied credit due to the high probability of defaulting on the financial obligation. The classification is contingent on the sociodemographic characteristics of the borrower (such as age, education level, occupation, and income), the repayment history on previous loans, and the type of loan. These models are also applicable to small businesses since these may be regarded as extensions of an individual costumer. In the last few decades, various quantitative methods were proposed in the literature to evaluate consumer loans and improve the credit scoring accuracy (for a review, see, e.g., Crook et al. [1]). These models can be grouped into statistical or machine learning models. The most popular statistical models are the linear discriminant analysis and the logistic regression. Linear discriminant analysis was the first parametric technique suggested for credit scoring purposes [3]. This approach has attracted criticism due to the categorical nature of the data and the fact that the covariance matrices of the good credit and bad credit groups are typically distinct. The logistic regression allows to overcome these deficiencies and became a common credit scoring tool of practitioners in financial institutions [4]. Machine learning techniques applied to credit scoring include the k -nearest neighbor [5], decision trees [6,7], artificial neural networks [2,8,9,10], genetic programming [11], and support vector machines [12,13,14,15,16]. Research on hybrid data mining approaches has also showed promising results [17,18,19].

While the pursuit of better classifiers for credit scoring applications is an important research effort, improved accuracy can be easily achieved by aggregating the predictions given by an ensemble of individual models; see, e.g., [20]. West et al. [21] found that the accuracy of an ensemble of neural networks is superior to that of a single neural network in credit scoring and bankruptcy prediction applications. This paper proposes a credit scoring model of consumer loans based on boosted decision trees, a powerful learning technique in which an ensemble of decision trees is developed to form a classifier given by a weighted majority vote of classifications predicted by the individual trees. The decision trees are grown sequentially using reweighted training sets. If an instance is misclassified by a tree its weight is increased. Consequently, the predominance of “hard-to-classify” instances in the training sample increases with the number of grown trees. The performance of boosted decision trees was evaluated using two real world credit datasets from the UC Irvine Machine Learning Repository [22] and compared to that of a multilayer perceptron and a support vector machine (this paper is an extended version of an early draft titled “Credit scoring with boosted decision trees” [23]; many important papers on credit scoring with boosted decision trees have been published in the meantime (e.g., [24,25,26,27,28])).

The rest of this paper is organized as follows. In the next section, boosted decision trees are introduced. This is followed by a description of the data sets and a comparison of the predictive accuracy of the models. A discussion of the relative contribution of the attributes to separate the good credit and bad credit classes is also given. Section 3 concludes the paper.

1.1. Decision Trees

Suppose one has a database of several credit applicants described by n attributes or characteristics: x 1 , x 2 ⋯ x n . These applicants belong to two classes that will be denoted by “good credits” and “bad credits”. The goal of a credit scoring model is to find a classifier that separates the good credit sample from the bad credit sample. A decision tree consists of a set of sequential binary splits of the data. The algorithm begins with a root node containing all credit applicants. Then, the algorithm loops over all possible binary splits in order to find the attribute x and corresponding cut-off value c , which gives the best separation into one side having mostly good credits and the other mostly bad credits. For example, in Figure 1 this is achieved when the data in the root node are split between instances with the attribute x i ≥ c i and those with x i < c i . This procedure is then repeated for the new daughter nodes until a stopping criterion is satisfied. Defining the purity p of a node as the fraction of good credit instances in it, the splitting attribute and cut-off value are those that minimize the sum of the Gini indices p ( 1 − p ) of the created daughter nodes. If, for any attribute or cut-off value, the sum of the Gini indices of the daughter nodes is higher than the Gini index of the parent node, the parent node is not split. Since the Gini index is a measure of the statistical dispersion or diversity of the population in a node, minimizing the Gini index results in daughter nodes that are more homogeneous than the parent nodes.

Unsplit nodes are denoted by “leafs” and are depicted by rectangles in Figure 1. The leafs are classified according to the most prevalent class in them. A leaf is called a “good credit leaf” if it contains a number of good credit applicants larger than the number of bad credit applicants. Otherwise, it is called a “bad credit leaf”. A good (bad) credit is correctly classified if it lands on a good (bad) credit leaf. Very frequently the resulting trees are quite large. Note that, in principle, a decision tree could be grown until all leafs contain only good credit instances or only bad credit instances. However, such a tree would overfit the training data. In these circumstances, the generalization performance may be improved if the tree is “pruned”. Pruning consists in cutting back the tree in order to get rid of redundant nodes [29].

Decision trees have been available since the 1980s, and have been applied to the implementation of credit scoring models [6,7]. They are a powerful and flexible classifier. However, a well-known limitation of decision trees is their instability, since small fluctuations in the training data may result in large variations in the classifications assigned to the observations. For example, if there are two attributes having similar discrimination power, a small fluctuation in one of these attributes may cause the algorithm to split a given node using the other attribute, while the former would have been selected without the fluctuation. Since the whole tree structure is modified below this node, the fluctuation may produce a very different classifier. This difficulty is overcome by growing an ensemble of decision trees, and classifying the instances by the majority vote of the classifications given by the individual trees.

1.2. Boosting

Boosting [30,31] is a procedure that aggregates many classifiers in order to achieve a high classification performance. Additionally, boosting helps stabilize the response of classifiers with respect to changes in the training sample. The boosting algorithm initiates by giving all credit applicants the same weight w ( 0 ) . After a classifier is built, the weight of each applicant is changed according to the classification given by that classifier. Then, a second classifier is built using a re-weighted training sample. This procedure is typically repeated several hundreds of times. The final classification of a credit applicant is a weighted average of the individual classifications over all classifiers. There are several methods to update the weights and combine the individual classifiers. A popular boosting algorithm is AdaBoost [32], which was adopted in this study. After the k th decision tree is built, the total misclassification error ε k of the tree, defined as the sum of the weights of misclassified credits over the sum of the weights of all credits, is calculated:

ε k = ∑ i m i s w i ( k ) / ∑ i w i ( k ) ,

where i loops over all instances in the data sample. Then, the weights of misclassified credit applicants are increased ( boosted ):

w i ( k + 1 ) = 1 − ε k ε k w i ( k ) .

Finally, the new weights are renormalized, w i ( k + 1 ) → w i ( k + 1 ) / ∑ i w i ( k + 1 ) , and the tree k + 1 is constructed. Note that, as the algorithm progresses, the predominance of hard-to-classify instances in the training set is increased. The final classification or “score” of credit applicant i is a weighted sum of the classifications over the individual trees:

F i = ∑ k = 1 K log 1 − ε k ε k f i ( k ) ,

where f i ( k ) = 1 ( − 1 ) if the k th tree makes the instance land on a good (bad) credit leaf, and K is the number of grown trees. Therefore, good credits will tend to have large positive scores, while bad credits will tend to have large negative scores. Furthermore, trees with lower misclassification errors ε k are given more weight when the final classification is computed.

2. Empirical Analysis

2.1. Data Sample

In this study, the credit scoring models were developed using two popular credit card application datasets from the UC Irvine Machine Learning Repository [22]. The German credit dataset consists of 1000 instances, of which 700 instances correspond to creditworthy applicants and 300 instances correspond to applicants to whom credit should not be extended. Each applicant is described by 24 attributes describing the status of existing accounts, credit history records, loan amount and purpose, employment status, and an assortment of personal information such as age, sex, and marital status. Three attributes are continuous and the remaining are categorical. A detailed description of these attributes can be found in the Appendix A.

The Australian credit dataset contains 690 instances, of which 307 correspond to creditworthy applicants and 383 correspond to applicants to whom credit should be refused. Each instance is described by 14 attributes. Six attributes are continuous while the remaining are categorical. In order to preserve the confidentiality of the data, the names and values of the attributes were replaced by meaningless identifiers. A few instances had attributes with missing values; these were replaced by the mode and mean of the attribute for categorical and continuous variables, respectively. Note that, because in the node splitting procedure only the best discriminating variable is selected, boosted decision trees are insensitive to the inclusion of attributes with weak discriminating power, while the training time only scales linearly with the dimensionality of the input patterns.

2.2. Performance Tuning

In a pattern classification problem, the data sample is usually divided into a training set and an independent (out-of-sample) test set. The classifier learns the data with the training set, and its predictive power is estimated using the test set. In order to train classifiers with a large fraction of the available data and evaluate the generalization accuracy with the complete dataset, a 10-fold cross-validation was implemented. This technique consists of randomly dividing the dataset into ten mutually exclusive subsets of equal size and, sequentially, testing each of these subsets using the classifier trained on the remaining subsets.

There is no formal theory specifying how to select the optimal hyper-parameters for a given classifier. In practice, the selection of the best set of hyper-parameters is accomplished either by heuristic rules or by “grid-search”. In this approach, different sets of hyper-parameter values are scanned and the set with best predictive performance is selected. The performance of boosted decision trees (BDT) is optimized by adjusting two hyper-parameters: the number of decision trees that are aggregated to form the final classifier and the minimum number of credit applicants that a tree node must contain in order to be split. When the number of applicants in a node reaches this threshold value the growth of the branch is terminated.

The multilayer perceptron (MLP) contained a single hidden layer (a network with a single hidden layer is sufficient to model a complex system to any desired degree of accuracy, provided sufficient hidden nodes are available [33]). The input layer contained a number of nodes equal to the number of attributes in the samples (24 nodes for the German dataset and 14 nodes for the Australian dataset), while the output layer contained a single node. The activation functions of the nodes were sigmoids. The network was trained by error back-propagation using the steepest descent algorithm. Three parameters were optimized: the number of neurons in the hidden layer, the number of epochs, and the learning rate. The support vector machine (SVM) was implemented with a Gaussian radial basis function. Two parameters were optimized: the width of the Gaussian kernel σ and the cost parameter C . To find the best pair ( σ , C ) a grid-search was performed using the recipe in [34], in which these parameters take values from sequences of powers of 2. All models were implemented using the framework provided by the TMVA package [35].

2.3. Results

The performance of credit scoring models is measured in terms of the capability to distinguish the good credit population from the bad credit population in the test sample. As mentioned in Section 1.2, the BDT algorithm assigns to credit applicants a score according to Equation (3). Good credits will typically have large positive scores while bad credits will have large negative scores. Credit applicants with a score above a certain threshold value are granted credit, while the remaining are rejected. For a given cut-off value there are two types of incorrect predictions: the model grants credit to an applicant that will default on the financial obligation (Type I error) and the model rejects credit to an applicant that is creditworthy (Type II error or False Alarm Ratio). The cut-off value represents a compromise between a large efficiency for granting credit and a large rejection of bad credits. An excessively large tendency for granting credit may result in severe economic losses due to delinquent costumers, while a credit policy that is too strict may result in opportunity costs larger than the costs of default. The selected cut-off value will ultimately depend on the relative ratio of the misclassification costs associated with Type I and Type II errors (in general, the costs associated with misclassifying bad applicants are financially more damaging than those associated with misclassifying good applicants).

Since the cut-off value depends on the credit policy of the financial institution, it is convenient to express the performance of the models in terms of the receiver operating characteristics (ROC) curve. The ROC curve is a plot of the true positive rate (proportion of bad credit that are correctly classified) as a function of the false positive rate (Type II error) for the full range of possible cut-off values. Figure 2 and Figure 3 show the ROC curves for the German and Australian credit datasets obtained by merging the 10 cross-validation test sets, respectively. If a model could completely separate the two populations, it would always give correct predictions. In this case, the ROC curve would pass through the point (0,1) and the area under the ROC curve would be equal to 1. On the other hand, a random guess classifier would result in as many correct predictions as incorrect predictions being made. In this case, for any cut-off value, the true positive rate would be on average equal to the false positive rate, and the ROC curve would be a 45 degree straight line intersecting (0,0) and (1,1). A model that performs better than random guessing gives a concave ROC curve above this straight line. The higher the model accuracy, the steeper will the ROC curve be. Therefore, the area under the ROC curve (AUC) is a measure of the generalization accuracy that is independent of the cut-off value.

Table 1 presents the AUC for the three models obtained by trapezoidal integration. For the German dataset the SVM outperformed the MLP, whereas BDT outperformed both the MLP and the SVM. For the Australian dataset a similar ordering of the predictive performance of the three models was observed.

2.4. Comparison of the AUC Estimates

In order to test the statistical significance of the differences between the areas under the ROC curves predicted by the models under consideration, the nonparametric approach introduced in [36] was followed. The AUC can be interpreted as the probability that the score of a randomly selected good credit applicant is higher than that of a randomly selected bad credit applicant. Therefore, denoting by X i ( g ) , i = 1 , ⋯ , n g the estimated scores for the good credit set and by X j ( b ) , j = 1 , ⋯ , n b the estimated scores for the bad credit set, an unbiased estimator of the AUC is given by the Wilcoxon–Mann–Whitney statistic:

θ ^ = 1 n b n g ∑ j = 1 n b ∑ i = 1 n g 1 X i ( g ) > X j ( b ) ,

where the indicator function 1 X i ( g ) > X j ( b ) is 1 if X i ( g ) > X j ( b ) , and 0 otherwise. In order to obtain an estimate of the variance of θ ^ , the structural components of the i th good credit and j th bad credit must be calculated:

v ( X i ( g ) ) = 1 n b ∑ j = 1 n b 1 X i ( g ) > X j ( b ) , v ( X j ( b ) ) = 1 n g ∑ i = 1 n g 1 X i ( g ) > X j ( b ) .

Then, an estimator for the variance of θ ^ can be obtained from:

V a ^ r ( θ ^ ) = 1 n g ( n g − 1 ) ∑ i = 1 n g v ( X i ( g ) ) − θ ^ 2 + 1 n b ( n b − 1 ) ∑ j = 1 n b v ( X j ( b ) ) − θ ^ 2 .

In order to compare the AUC of two alternative models, A and B , the covariance of the corresponding AUC estimators must also be obtained:

C o ^ v ( θ ^ A , θ ^ B ) = 1 n g ( n g − 1 ) ∑ i = 1 n g v A ( X i ( g ) ) − θ ^ A v B ( X i ( g ) ) − θ ^ B + 1 n b ( n b − 1 ) ∑ j = 1 n b v A ( X j ( b ) ) − θ ^ A v B ( X B j ( b ) ) − θ ^ B .

To test the null hypothesis H 0 : θ ^ A = θ ^ B versus the alternative hypothesis H 1 : θ ^ A ≠ θ ^ B the following test statistic is computed:

T = θ ^ A − θ ^ B 2 V a ^ r ( θ ^ A − θ ^ B ) , V a ^ r ( θ ^ A − θ ^ B ) = V a ^ r ( θ ^ A ) + V a ^ r ( θ ^ B ) − 2 C o ^ v ( θ ^ A , θ ^ B ) . The test statistic T is asymptotically χ 2 -distributed with one degree of freedom.

Table 2 shows the results of applying this test to the estimated ROC curves. For both datasets one can reject the null hypothesis H 0 : θ ^ B D T = θ ^ M L P with a 95% significance level and, therefore, there is strong evidence that the performance of BDT is better than that of the MLP. For the Australian dataset there is also strong evidence that BDT outperformed SVM. However, for the German dataset the difference between these methods was not statistically significant.

2.5. Relative Importance of the Attributes

Boosted decision trees provide a straightforward and intuitive measure of the relative contribution of the attributes to separate instances according to the target classification. Using this approach, a ranking of the most useful attributes can be established. This ranking is derived by counting the number of times an attribute is employed in the node-splitting procedure and by weighting each split by the separation gain it has accomplished and by the number of instances in the node [29].

Figure 4 shows the relative importance of the attributes for the German credit dataset. The first and fourth attributes are the most important. These attributes correspond to the status of the existing checking accounts and the credit amount, respectively. They are followed by the second attribute (duration of the loan) and the tenth attribute (age of the applicant). Also important is the third attribute, which represents the credit history of the applicant (e.g., if previous credits were paid punctually or there were delays in paying off). The fifth to ninth attributes have moderate importance. They correspond to the status of savings accounts, the employment condition, the marital status and sex, the amount of years living in the present residence, and the property that the applicant owns, respectively. Figure 5 shows the relative importance of the attributes for the Australian credit dataset. The nature of the attributes in this dataset is unknown. In this dataset, the eighth attribute is clearly the most important. Also of note is that the contributions of attributes 1, 11, and 12 are negligible.

3. Conclusions

This paper introduced a credit scoring model of consumer loans using boosted decision trees—a learning technique that allows to combine several decision trees to form a classifier that is obtained from a weighted majority vote of the classifications given by individual trees. The generalization accuracy of boosted decision trees was compared with that of a multilayer perceptron and support vector machines. Boosted decision trees outperformed the multilayer perceptron and the support vector machines on two real-world credit card application datasets. On the basis of these results, it can be concluded that boosted decision trees may be a competitive alternative to these techniques in credit scoring applications. It was also shown that boosted decision trees provide an elegant way to rank the attributes that most significantly affect the likelihood of default.

Funding

This research was funded by FCT—Fundação para a Ciência e a Tecnologia, under grant number UIDB/05069/2020.