Modeling Attrition
Stuart Miller August 14, 2019
## Computation Setup
# import libraries
library(knitr)
library(tidyverse)
library(GGally)
library(gridExtra)
library(RColorBrewer)
library(gplots)
library(corrplot)
library(ggthemes)
library(caret)
set.seed(3456)
# import helper functions
source('../helper/data_munging.R')
source('../helper/visual.R')
# read in data
train <- read_csv('../data/CaseStudy2-data_train.csv')
# create a vector of numeric features
features.numeric <- c('DailyRate', 'DistanceFromHome', 'Age', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate',
'NumCompaniesWorked','PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear',
'YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion', 'YearsWithCurrManager')
# create a vector of numeric features
features.factor <- c('BusinessTravel', 'Department', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance')
Variable Munging
Based on EDA several categorical levels will be releveled. Levels of variables that appear to have similar rates of attrition will be combined.
Groupings
- JobInvolvement
- Level 3 is similar to level 2
- Level 4 is similar to level 2
- EnvironmentSatisfaction
- Level 3 is similar to level 2
- Level 4 is similar to level 2
- WorkLifeBalance
- Level 3 is similar to level 2
- Level 4 is similar to level 2
- StockOptionLevel
- Level 0 is similar to level 3
- Level 1 is similar to level 2
- Education
- Level 2 is similar to level 1
- Level 3 is similar to level 1
- Level 4 is similar to level 1
- JobLevel
- Level 2 is similar to level 5
- Level 3 is similar to level 5
- JobRole
- Sales Executive, Human Resources, Laboratory Technician, and Research Scientist are similar
- Manufacturing Director, Research Director, Manager, and Healthcare Representative are similar
- BusinessTravel
- Non-Travel and Travel_Rarely are similar
- EducationField
- Technical Degree and Human Resources are similar
- Life Sciences and Medical are similar
- Marketing and Other are similar
- JobSatisfaction
- Level 3 is similar to level 1
- Level 4 is similar to level 1
# drop variables that are not useful - static or unique
train <- select(train, -c('EmployeeCount', 'EmployeeNumber','StandardHours', 'Over18'))
# factor categorical variables
train[, features.factor] <- lapply(train[, features.factor], as.factor)
train$Attrition <- as.factor(train$Attrition)
train$JobInvolvement <- recode(train$JobInvolvement, '3' = '2', '4' = '2')
train$EnvironmentSatisfaction <- recode(train$EnvironmentSatisfaction, '3' = '2', '4' = '2')
train$WorkLifeBalance <- recode(train$WorkLifeBalance, '3' = '2', '4' = '2')
train$StockOptionLevel <- recode(train$StockOptionLevel, '0' = '3', '1' = '2')
train$Education <- recode(train$Education, '2' = '1', '3' = '1', '4' = '1')
train$JobLevel <- recode(train$JobLevel, '2' = '5', '3' = '5')
train$JobRole <- as.factor(recode(train$JobRole, "Sales Representative" = 3,
"Research Scientist" = 2,
"Sales Executive" = 2,
"Human Resources" = 2,
"Laboratory Technician" = 2,
"Manufacturing Director" = 1,
"Research Director" = 1,
"Manager" = 1,
"Healthcare Representative" = 1))
train$BusinessTravel <- recode(train$BusinessTravel, "Non-Travel" = "Travel_Rarely" )
train$EducationField <- recode(train$EducationField,
"Technical Degree" = '1',
"Human Resources" = '1',
"Life Sciences" = '2',
"Marketing" = '3',
"Medical" = '2',
"Other" = '3')
train$JobSatisfaction <- recode(train$JobSatisfaction,
'3' = '1',
'2' = '1')
Data Partitioning
The training dataset will be split into a training set and a test set.
The training set dfTrain
will be used for model cross validation. The
test set, dfTest
will be used for final model selection.
# split off a test set
trainIndex <- createDataPartition(train$Attrition , p = .75,
list = FALSE,
times = 1)
dfTrain <- train[trainIndex,]
dfTest <- train[-trainIndex,]
Model Training
With a small data set for training, the model will be initially trained with repeated 5-fold CV. Because there are a large number of categorical that appear to be useful for predicting attrition (based on EDA), a naive bayes model will be used.
# Set up repeated k-fold cross-validation
train.control <-trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE)
# Train the model
model.cv <-train(y = dfTrain$Attrition,
x = dfTrain[, c('OverTime' ,
'JobInvolvement' ,
'EnvironmentSatisfaction',
'WorkLifeBalance' ,
'StockOptionLevel' ,
'Age',
'YearsInCurrentRole',
'YearsAtCompany',
'JobLevel',
'JobRole',
'TotalWorkingYears',
'MonthlyIncome',
'DistanceFromHome',
'YearsWithCurrManager',
'BusinessTravel',
'EducationField')],
method = 'nb',
metric = 'Spec',
trControl = train.control)
# print model summary
model.cv
## Naive Bayes
##
## 653 samples
## 16 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 522, 522, 522, 523, 523, 522, ...
## Resampling results across tuning parameters:
##
## usekernel ROC Sens Spec
## FALSE 0.7868869 0.8606038 0.5657143
## TRUE 0.7802677 0.9032761 0.4952381
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Spec was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
## and adjust = 1.
Model Validation
A previously unseen dataset will be used to verify that model is not overfit to the training data.
The current model provides a specificity of 0.66 on the validation data.
preds <- predict(model.cv, dfTest[, !names(dfTest) %in% ('Attrition')])
confusionMatrix(table(preds, dfTest$Attrition))
## Confusion Matrix and Statistics
##
##
## preds No Yes
## No 157 12
## Yes 25 23
##
## Accuracy : 0.8295
## 95% CI : (0.7727, 0.877)
## No Information Rate : 0.8387
## P-Value [Acc > NIR] : 0.68356
##
## Kappa : 0.452
## Mcnemar's Test P-Value : 0.04852
##
## Sensitivity : 0.8626
## Specificity : 0.6571
## Pos Pred Value : 0.9290
## Neg Pred Value : 0.4792
## Prevalence : 0.8387
## Detection Rate : 0.7235
## Detection Prevalence : 0.7788
## Balanced Accuracy : 0.7599
##
## 'Positive' Class : No
##