Executive Summary

In this analysis, we present factors related to attrition. Three top factors are presented and relations to other factors are described. Models for attrition and income are presented along with a discussion on performance. Results of the income modeling are interpreted. Finally, some other trends in the data are described.

Introduction

Attrition is costly. Attrition presents loss of talent and manpower, which often results in schedule delays, loss of productivity, and costs of hiring new talent. In this analysis, we examine a dataset employee attributes with attrition labels to find attributes associated with attrition and create a predictive model for attrition. First, the top factors associated with attrition will be presented and explained. Then, a predictive model will be shown and explained. Additionally, a predictive model for income will be presented. The income predictor was a special request along with the main analysis. Lastly, an interesting trends found in the data will be reported.

Attrition Analysis

The data science team as identified three top factors that are associated with employee attrition. These factors are addressed individually below. Additionally, the associations between these factors and other variables are discussed.

Top Three Factors

The top three factors associated with attrition are

  • Job Involvement
  • Work-Life Balance
  • Job Level

Job Involvement

Overview of Impact on Attrition

Nearly 50% of employees rating their job involement at 1 are leaving the company. The lowest level job involvement is substantially higher than the other job involvement levels.

Impact on Job Roles

Almost all job roles show high attrition with low job involvement. The roles human resources, research scientist, sales executive, sales representative, Manager, healthcare representative, and laboratory technician attrition for the lowest job involvement. Notibly, attrition rates of Sales Representative, Manufacturing Director, and Research Director do not seem to be affected by job involvement.

Impact on Job Level

Attriton is high for all job levels where job invovlement is low. The attrition rate is especially high (over 50%) in level 1 and 5.

Work-Life Balance

Approximate 35% of employees rating their work-life balance at 1 are leaving the company. The lowest level work-life balance is substantially higher than the other job involvement levels.

Impact on Job Roles

The attrition rate for laboratory technicians and sales executives is especially high for employees reporting low work-life balance, approximately 75% and 60% respectively. The attration rate for sales representative seems usual on this chart. This is because it it unusually high compared to all other roles and is likely driven by something else.

Impact on Job Levels

Plotting the propriton of attrition by work-life balance rating for job levels shows that generally lower rating of work-life balance correlate with high attrtion rate.

Overtime

Approximate 35% of employees performing overtime work are leaving the company. The attrition rate for overtime workers is substantially higher than employees not working overtime.

Job Role Interaction

The roles human resources, research scientist, sales executive, sales representative, Manager, healthcare representative, and laboratory technician all show high attrition rates for overtime workers.

Job Level Interaction

Overtime workers with a job level of 1 are showing very high attrition rates (above 50%).

Sales Representative Attrition

Sales representative have a very high attrition rate, nearly 50%. This is much higher than al other roles, which are below 25%.

Sales representatives also general have to shortest tenure at the company and the fewest total working years on average.

Other Interesting Trend

There appears to be a correlation between years worked at the company and years in current role, years with current manager, and years since last promotion. This correlation appears to be almost linear. However, a linear regression does not meet the required assumptions.

p1 <- train %>% filter(YearsAtCompany > 0) %>%
  ggplot(aes(x = YearsWithCurrManager, y = YearsAtCompany)) + 
  geom_point() + geom_smooth(method = 'lm')
p2 <- train %>% filter(YearsAtCompany > 0) %>%
  ggplot(aes(x = YearsInCurrentRole, y = YearsAtCompany)) + 
  geom_point() + geom_smooth(method = 'lm')
grid.arrange(p1,p2, ncol = 2)

Modeling

The data science team was tasked with modeling two features of the dataset: employee attrition and employee income.

Attrition Modeling

Prediction of attrition was attempted with naive bayes. The target was to create a model with at least 60% specificity and 60% sensitivity. A large number of variables that appeared to be significant for attrition prediction were included. The model performance metrics are listed below in the top table. The results of the model predictions are listed in the second table.

Based on the preformance of the model, the goals for sensitivity and specificity were met.

Model Performance

Specificity Sensitivity
0.6216216 0.8547486

Model Prediction on Test Set

Predicted.Attrition Missed.Attrition
23 14

Income Modeling

Income was modeled with linear regression (OLS). Modeling with other models, such as KNN, was attempted, but linear regression provided the best model with interpretable results. A discussion on the model construction and validation is given below.

Model Construction

From EDA, it appears that monthly income is correlated to TotalworkingYears, Age, YearsAtCompany, YearsInCurrentRole, and YearsWithCurrentManager.

From the factors:

  • JobLevel appears to partion TotalWorkingYears and MonthlyIncome very well.
  • JobRole appears to partion MonthlyIncome very well.

Based on EDA and feature selection, the following model will be used for linear regression:

\[ \mu \lbrace MonthlyIncome \rbrace = \hat\beta_0 + \hat\beta_1 (JobLevel) + \hat\beta_2(JobRole) + \hat\beta_3(TotalWorkingYears) \]

Model Assessment

Assessment Plots

The model was trained with the entired dataset. The trained model was used to generate assessment plots.

  • The QQ plot and histogram of residuals do not appear to provide evidence against the assumption of normality.
  • While there may be some hints against the assumption of constant variance, the violations do not appear to be egregious as around 5% of studentized residuals are expected to be outside \(\pm 2\).
  • The sampling procedure is not know so independence cannot be assessed. We will assume it is true and continue with caution.

Cross Validation and External Validation

The model was validated in two ways cross validation and external validation. Internal validation was used as an initial screening method on tentative models. The cross validation RMSE value ofr this model is shown in the table under RMSE.Train. Once well preforming models were chosen, the final model model was selected by running the models on an external validation set (unseen data). The external validation RMSE score is shown in the table below as RMSE.Test. An estimate of the variation in income explained by this model is also given in the table as Adj.R.Square.

RMSE.Test RMSE.Train Adj.R.Square
981.5453 1007.498 0.9525286

Model Interpretation

The model requires that all categorical variables have the same slope between MonthlyIncome and TotalWorkingYears becasue no interaction terms were included. The categorical variables only provide a difference in intercept for the regression between MonthlyIncome and TotalWorkingYears.

The estimates and p-values from the model fit are shown below. We find that for an incease in total working years of one year there is an associated increase in mean monthly income of $44.46. The change in intercept for each job level appears to be significantly different (level 1 was used for reference). The change in intercept for each job role appears to be significantly different except for manufacturing director and sales executive. There is not sufficent evidence to suggest that the intercepts for manufacturing director and sales executive are significantly different than the reference (Healthcare Representative).

Variable Estimate p-value
(Intercept) 3561.09 < 2e-16
Total Working Years 44.46 8.04e-07
Job Level 2 1742.46 < 2e-16
Job Level 3 4893.21 < 2e-16
Job Level 4 8191.81 < 2e-16
Job Level 5 10960.61 < 2e-16
Job Role: Human Resources -984.80 0.00084
Job Role: Laboratory Technician -1163.75 1.75e-08
Job Role: Manager 3436.58 < 2e-16
Job Role: Manufacturing Director 153.28 0.40279
Job Role: Research Director 3562.03 < 2e-16
Job Role: Research Scientist -981.12 2.40e-06
Job Role: Sales Executive -40.24 0.79983
Job Role: Sales Representative -1220.67 1.71e-06

Appendix

Session Info

The session info output for is file is provided in the codebook (./CodeBook.md). The session info was generated by calling SessionInfo() on the completion of knitting this Rmd file.