Good or bad Loans?

Leo Liu
Nov 1, 2018
4 min read

Updated: Dec 11, 2018

Background

Nowadays, loan plays an important role in generating revenues for the banking industry. Many financial institution tend to attract their clients to apply for their loan products. However, some clients cannot meet their obligations of paying on time after their loan fund is issued. It is critical for financial institutions to detect those risky clients and enhance their policies against potential losses.

Although financial institutions have a variety of financial information about the clients, such as employment, annual income, homeownership, credit scores, it seems that there is not a cost wise way of client screening. The tradeoff here is if the rule is too strict many good clients will be turned down and the profit from them will be lost; otherwise, there might be great potential loss from default cases. Then, it motivates me to build up a machine learning model which helps financial institutions or individual to screen their clients strategically.

Data Source

The dataset I used for my analysis is Kaggle’s Lending Club Loan Data, which records loan data of loan issued through 2007-2015. It has 75 variables and around 890 thousand observations.

Lending Club is a peer-to-peer lending company that matches borrowers with investors through an online platform. It services people that need personal loans between $1,000 and $40,000. Borrowers receive the full amount of the issued loan minus the origination fee, which is paid to the company. Investors purchase notes backed by the personal loans and pay Lending Club a service fee. The company shares data about all loans issued through its platform during certain time periods.

Target Labeling and Feature Selection

I choose the variable “loan_status” as my dependent variable(target) and another 30 features about the client’s employment, annual income, homeownership, credit score level, number of financial inquiries over the last 6 months, collections among accounts other than Lending Club and the loan’s amount and term. This information had been collected before the application got approved.

Out of the 890,000 cases, there are around 700,000 with a loan status of “Current” or “Issued”, which has not been payed off yet or has not triggered a late event yet. I took these 700,000 cases out of my scope of analysis because it is hard to determine whether it is a good or bad loan. For the remaining 200,000 cases, I labeled ones with a loan status of “Charged Off”, “Late (1-30 days)”, “Late (31-120 days)”, “Default”, or “In Grace Period” as “bad” loans, and ones which have already been fully paid as "good" loans. Here are the percentages of good and bad loans. (See Figure 1)

Modeling

My model was constructed to predict whether a loan will be good or bad after it gets approved. A sample of 10,000 data points were drawn to build up the model for computational efficiency. I used 70% of data to train my model and tested on the other 30%. F1-score was chosen as the primary metric, and ROC_AUC score the secondary in model selection. In comparison of all classifier, including K Nearest Neighbor, Logistic Regression, Supporting Vector Machine, Decision Trees, Random Forest, Gradient Boosting and Naïve Bayes, Gradient Boosting outperforms others with a higher f1-score 0.51 and slightly better ROC_AUC score 0.79. The normalized confusion matrix and ROC curve are shown below (See Figure 2).

ree — Figure 2 - Normalized Confusion Matrix and ROC Curve

To achieve optimal profitability, I conducted a cost benefit analysis. Based on exploratory data analysis, the cost of turning down a good client is $1,902, and the cost of not detecting a bad client is $8,509. Then, I adjusted the cutoff to minimize the cost. The cutoff here is a threshold for the predicted probability. For example, a client with a predicted probability of 0.4 being a bad client will be determined as a bad client when the cutoff is set up to 0.3. However, when the cutoff is set up to 0.5, the same client will be determined as a good client. The optimal cutoff for my best model is 0.38.

Conclusion

In conclusion, my model will help the institution save $858 dollars per client, which sums to $24M annually over the time period under study. Furthermore, a feature importance analysis was carried out to provide loan policy makers some insights of ways to mitigate risk and losses. According to the bar chart (See figure 3), credit score grade and recoveries have a significant impact on the prediction result. Hence, it is reasonable to propose a higher installment or an upper limit of loan amount for clients with a low credit score or high recoveries from other unpaid bills. It seems self-explanatory but a regression model can be built to tune those policies precisely.

Future Work

The model will be further refined by adding/engineering more features, running with larger samples on AWS machine and building a beta version application on flask for end users to test and give feedback.

It is even more valuable to study when or under what circumstance clients are more likely to behave negatively than to predict whether a client will default at the application screening phase. I would continue to do a time series analysis on those bad cases.