In this analysis, we present factors related to attrition. Three top factors are presented and relations to other factors are described. Models for attrition and income are presented along with a discussion on performance. Results of the income modeling are interpreted. Finally, some other trends in the data are described.
Attrition is costly. Attrition presents loss of talent and manpower, which often results in schedule delays, loss of productivity, and costs of hiring new talent. In this analysis, we examine a dataset employee attributes with attrition labels to find attributes associated with attrition and create a predictive model for attrition. First, the top factors associated with attrition will be presented and explained. Then, a predictive model will be shown and explained. Additionally, a predictive model for income will be presented. The income predictor was a special request along with the main analysis. Lastly, an interesting trends found in the data will be reported.
The data science team as identified three top factors that are associated with employee attrition. These factors are addressed individually below. Additionally, the associations between these factors and other variables are discussed.
The top three factors associated with attrition are
Nearly 50% of employees rating their job involement at 1 are leaving the company. The lowest level job involvement is substantially higher than the other job involvement levels.
Almost all job roles show high attrition with low job involvement. The roles human resources, research scientist, sales executive, sales representative, Manager, healthcare representative, and laboratory technician attrition for the lowest job involvement. Notibly, attrition rates of Sales Representative, Manufacturing Director, and Research Director do not seem to be affected by job involvement.
Attriton is high for all job levels where job invovlement is low. The attrition rate is especially high (over 50%) in level 1 and 5.
Approximate 35% of employees rating their work-life balance at 1 are leaving the company. The lowest level work-life balance is substantially higher than the other job involvement levels.
The attrition rate for laboratory technicians and sales executives is especially high for employees reporting low work-life balance, approximately 75% and 60% respectively. The attration rate for sales representative seems usual on this chart. This is because it it unusually high compared to all other roles and is likely driven by something else.
Plotting the propriton of attrition by work-life balance rating for job levels shows that generally lower rating of work-life balance correlate with high attrtion rate.
Approximate 35% of employees performing overtime work are leaving the company. The attrition rate for overtime workers is substantially higher than employees not working overtime.
The roles human resources, research scientist, sales executive, sales representative, Manager, healthcare representative, and laboratory technician all show high attrition rates for overtime workers.
Overtime workers with a job level of 1 are showing very high attrition rates (above 50%).
Sales representative have a very high attrition rate, nearly 50%. This is much higher than al other roles, which are below 25%.
Sales representatives also general have to shortest tenure at the company and the fewest total working years on average.
There appears to be a correlation between years worked at the company and years in current role, years with current manager, and years since last promotion. This correlation appears to be almost linear. However, a linear regression does not meet the required assumptions.
p1 <- train %>% filter(YearsAtCompany > 0) %>%
ggplot(aes(x = YearsWithCurrManager, y = YearsAtCompany)) +
geom_point() + geom_smooth(method = 'lm')
p2 <- train %>% filter(YearsAtCompany > 0) %>%
ggplot(aes(x = YearsInCurrentRole, y = YearsAtCompany)) +
geom_point() + geom_smooth(method = 'lm')
grid.arrange(p1,p2, ncol = 2)
The data science team was tasked with modeling two features of the dataset: employee attrition and employee income.
Prediction of attrition was attempted with naive bayes. The target was to create a model with at least 60% specificity and 60% sensitivity. A large number of variables that appeared to be significant for attrition prediction were included. The model performance metrics are listed below in the top table. The results of the model predictions are listed in the second table.
Based on the preformance of the model, the goals for sensitivity and specificity were met.
Model Performance
Specificity | Sensitivity |
---|---|
0.6216216 | 0.8547486 |
Model Prediction on Test Set
Predicted.Attrition | Missed.Attrition |
---|---|
23 | 14 |
Income was modeled with linear regression (OLS). Modeling with other models, such as KNN, was attempted, but linear regression provided the best model with interpretable results. A discussion on the model construction and validation is given below.
From EDA, it appears that monthly income is correlated to TotalworkingYears
, Age
, YearsAtCompany
, YearsInCurrentRole
, and YearsWithCurrentManager
.
From the factors:
JobLevel
appears to partion TotalWorkingYears
and MonthlyIncome
very well.JobRole
appears to partion MonthlyIncome
very well.Based on EDA and feature selection, the following model will be used for linear regression:
\[ \mu \lbrace MonthlyIncome \rbrace = \hat\beta_0 + \hat\beta_1 (JobLevel) + \hat\beta_2(JobRole) + \hat\beta_3(TotalWorkingYears) \]
The model was trained with the entired dataset. The trained model was used to generate assessment plots.
The model was validated in two ways cross validation and external validation. Internal validation was used as an initial screening method on tentative models. The cross validation RMSE value ofr this model is shown in the table under RMSE.Train
. Once well preforming models were chosen, the final model model was selected by running the models on an external validation set (unseen data). The external validation RMSE score is shown in the table below as RMSE.Test
. An estimate of the variation in income explained by this model is also given in the table as Adj.R.Square
.
RMSE.Test | RMSE.Train | Adj.R.Square |
---|---|---|
981.5453 | 1007.498 | 0.9525286 |
The model requires that all categorical variables have the same slope between MonthlyIncome
and TotalWorkingYears
becasue no interaction terms were included. The categorical variables only provide a difference in intercept for the regression between MonthlyIncome
and TotalWorkingYears
.
The estimates and p-values from the model fit are shown below. We find that for an incease in total working years of one year there is an associated increase in mean monthly income of $44.46. The change in intercept for each job level appears to be significantly different (level 1 was used for reference). The change in intercept for each job role appears to be significantly different except for manufacturing director and sales executive. There is not sufficent evidence to suggest that the intercepts for manufacturing director and sales executive are significantly different than the reference (Healthcare Representative).
Variable | Estimate | p-value |
---|---|---|
(Intercept) | 3561.09 | < 2e-16 |
Total Working Years | 44.46 | 8.04e-07 |
Job Level 2 | 1742.46 | < 2e-16 |
Job Level 3 | 4893.21 | < 2e-16 |
Job Level 4 | 8191.81 | < 2e-16 |
Job Level 5 | 10960.61 | < 2e-16 |
Job Role: Human Resources | -984.80 | 0.00084 |
Job Role: Laboratory Technician | -1163.75 | 1.75e-08 |
Job Role: Manager | 3436.58 | < 2e-16 |
Job Role: Manufacturing Director | 153.28 | 0.40279 |
Job Role: Research Director | 3562.03 | < 2e-16 |
Job Role: Research Scientist | -981.12 | 2.40e-06 |
Job Role: Sales Executive | -40.24 | 0.79983 |
Job Role: Sales Representative | -1220.67 | 1.71e-06 |
The session info output for is file is provided in the codebook (./CodeBook.md
). The session info was generated by calling SessionInfo()
on the completion of knitting this Rmd
file.