Home Credit Loan Default Analysis
Image by Nattanan Kanchanaprat from Pixabay
Created by Stuart Miller, Paul Adams, Justin Howard, and Mel Schwan.
Project Summary
This project is an analysis of lending data from Home Credit. Predicting defaulting loans was the primary goal of the project. Secondarily, we also segmented customers and provided an interpretation of the clusters.
Analysis
This project is broken into three main sections
- Data Exploration and Understanding
- Predictive Modeling
- Clustering Analysis and Segmentation
Data Exploration
Explored the data to understand the type of features and the relationships between features.
- Provided detailed description of the features and assessed data types.
- Assessed missing values and removed features with too much data corruption.
- Reported interesting features from univariate and multivariate analyses.
- Engineered new features from understanding of given features.
Report: MiniLab1 CRISP-DM: Notebook for data exploration phase.
Predictive Modeling
- Assessed importance of features with logistic regression and random forests.
- Created three models for predicting loan defaults.
- Tuned model hyperparameters to improve performance.
- Assessed model performances and selected the best model for the application.
- Created and tuned three models for a regression task (secondary objective).
- Provided a model deployment plan.
Report: MiniLab2 CRISP-DM: Notebook for predictive model development.
Clustering Analysis and Segmentation
- Evaluated several methods for dimensionality reduction.
- Reduced the 300+ dimensions to 2 dimensions with an autoencoder.
- Clustered on the reduced dimension set.
- Provided an interpretation of the cluster.
Report: Lab3 CRISP-DM: Notebook for clustering and segmentation.
Main Conclusions
Predictive Modeling
- We were able to create a list of important features.
- We developed a logistic model with AUC of 0.76. This model did not show evidence of over fitting.
Clustering and Segmentation
- We were able to create 8 well defined clusters, which can be easily interpreted.
- We were also able to develop a model to classify new customers into the clusters with reasonable AUC.
Additional Information
- The project repository can be found at https://github.com/sjmiller8182/home-credit-default-risk.
- This analysis was conducted in Python using Numpy, Pandas, Scikit-Learn, and Keras.