An Analysis of Loan Default Risk

Created by Stuart Miller, Paul Adams, Justin Howard, and Mel Schwan.

This analysis of Home Credit’s Default Risk dataset will focus on generating accurate loan default risk probabilities. Predicting loan defaults is essential to the profitability of banks and, given the competitive nature of the loan market, a bank that collects the right data can offer and service more loans. The target variable of the dataset is the binary label, TARGET, indicating whether the loan entered into default status (1) or not (0).

The final model will produce the probability of default for each loan and the predicted probabilities will be evaluated on the area under the ROC curve. We believe that a good predictive model is capable of achieving an accuracy between 70% and 80%.

Data

The data was provided by Home Credit and is hosted by Kaggle.

The dataset consists of 307,511 individual loans. For the purpose of this assignment, the analysis will be limited to the initial training and test sets, with the addition of three engineered features obtained from the bureau.csv.

Data Table Number of Features
application_{train}.csv 122
application_{test}.csv 121
newFeatures.csv 3
bureau.csv 17
bureau_balance.csv 3
POS_CASH_balance.csv 8
installments_payments.csv 8
credit_card_balance.csv 23
previous_application.csv 37

Analysis

This project is broken into three main sections

  1. Data Exploration and Understanding
  2. Predictive Modeling
  3. Clustering Analysis and Segmentation

Data Exploration

Explored the data to understand the type of features and the relationships between features.

Report: MiniLab1 CRISP-DM: Notebook for data exploration phase.

Predictive Modeling

Report: MiniLab2 CRISP-DM: Notebook for predictive model development.

Clustering Analysis and Segmentation

Report: Lab3 CRISP-DM: Notebook for clustering and segmentation.

Main Conclusions

Predictive Modeling

Clustering and Segmentation