9.1 BONUS: Attrition (decision tree classifier)

Let’s apply a decision tree as classifier to the attrition dataset

suppressMessages(library(tidymodels))
suppressMessages(library(tidyverse))
library(themis)

Load dataset

# load dataset
attrition <- modeldata::attrition

# clean names with `janitor` package
# coerce ordered factor variables to numeric
attrition <- attrition %>% 
     janitor::clean_names() %>% 
     # mutate_if(is.ordered, as.numeric) %>% 
     relocate(attrition, .before = everything())

First look at dataset

attrition %>% 
  glimpse()
## Rows: 1,470
## Columns: 31
## $ attrition                  <fct> Yes, No, Yes, No, No, No, No, No, No, No, N…
## $ age                        <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ business_travel            <fct> Travel_Rarely, Travel_Frequently, Travel_Ra…
## $ daily_rate                 <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department                 <fct> Sales, Research_Development, Research_Devel…
## $ distance_from_home         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education                  <ord> College, Below_College, College, Master, Be…
## $ education_field            <fct> Life_Sciences, Life_Sciences, Other, Life_S…
## $ environment_satisfaction   <ord> Medium, High, Very_High, Very_High, Low, Ve…
## $ gender                     <fct> Female, Male, Male, Female, Male, Male, Fem…
## $ hourly_rate                <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84,…
## $ job_involvement            <ord> High, Medium, Medium, High, High, High, Ver…
## $ job_level                  <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1…
## $ job_role                   <fct> Sales_Executive, Research_Scientist, Labora…
## $ job_satisfaction           <ord> Very_High, Medium, High, High, Medium, Very…
## $ marital_status             <fct> Single, Married, Single, Married, Married, …
## $ monthly_income             <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ monthly_rate               <int> 19479, 24907, 2396, 23159, 16632, 11864, 99…
## $ num_companies_worked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ over_time                  <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No,…
## $ percent_salary_hike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating         <ord> Excellent, Outstanding, Excellent, Excellen…
## $ relationship_satisfaction  <ord> Low, Very_High, Medium, High, Very_High, Hi…
## $ stock_option_level         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0…
## $ total_working_years        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ training_times_last_year   <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4…
## $ work_life_balance          <ord> Bad, Better, Better, Better, Better, Good, …
## $ years_at_company           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_in_current_role      <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2…
## $ years_since_last_promotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
## $ years_with_curr_manager    <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3…

Take a deeper look at each variable with skimr

skimr::skim(attrition) %>% 
  kable()
skim_type skim_variable n_missing complete_rate factor.ordered factor.n_unique factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
factor attrition 0 1 FALSE 2 No: 1233, Yes: 237 NA NA NA NA NA NA NA NA
factor business_travel 0 1 FALSE 3 Tra: 1043, Tra: 277, Non: 150 NA NA NA NA NA NA NA NA
factor department 0 1 FALSE 3 Res: 961, Sal: 446, Hum: 63 NA NA NA NA NA NA NA NA
factor education 0 1 TRUE 5 Bac: 572, Mas: 398, Col: 282, Bel: 170 NA NA NA NA NA NA NA NA
factor education_field 0 1 FALSE 6 Lif: 606, Med: 464, Mar: 159, Tec: 132 NA NA NA NA NA NA NA NA
factor environment_satisfaction 0 1 TRUE 4 Hig: 453, Ver: 446, Med: 287, Low: 284 NA NA NA NA NA NA NA NA
factor gender 0 1 FALSE 2 Mal: 882, Fem: 588 NA NA NA NA NA NA NA NA
factor job_involvement 0 1 TRUE 4 Hig: 868, Med: 375, Ver: 144, Low: 83 NA NA NA NA NA NA NA NA
factor job_role 0 1 FALSE 9 Sal: 326, Res: 292, Lab: 259, Man: 145 NA NA NA NA NA NA NA NA
factor job_satisfaction 0 1 TRUE 4 Ver: 459, Hig: 442, Low: 289, Med: 280 NA NA NA NA NA NA NA NA
factor marital_status 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327 NA NA NA NA NA NA NA NA
factor over_time 0 1 FALSE 2 No: 1054, Yes: 416 NA NA NA NA NA NA NA NA
factor performance_rating 0 1 TRUE 2 Exc: 1244, Out: 226, Low: 0, Goo: 0 NA NA NA NA NA NA NA NA
factor relationship_satisfaction 0 1 TRUE 4 Hig: 459, Ver: 432, Med: 303, Low: 276 NA NA NA NA NA NA NA NA
factor work_life_balance 0 1 TRUE 4 Bet: 893, Goo: 344, Bes: 153, Bad: 80 NA NA NA NA NA NA NA NA
numeric age 0 1 NA NA NA 3.692381e+01 9.1353735 18 30 36.0 43.00 60 ▂▇▇▃▂
numeric daily_rate 0 1 NA NA NA 8.024857e+02 403.5090999 102 465 802.0 1157.00 1499 ▇▇▇▇▇
numeric distance_from_home 0 1 NA NA NA 9.192517e+00 8.1068644 1 2 7.0 14.00 29 ▇▅▂▂▂
numeric hourly_rate 0 1 NA NA NA 6.589116e+01 20.3294276 30 48 66.0 83.75 100 ▇▇▇▇▇
numeric job_level 0 1 NA NA NA 2.063946e+00 1.1069399 1 1 2.0 3.00 5 ▇▇▃▂▁
numeric monthly_income 0 1 NA NA NA 6.502931e+03 4707.9567831 1009 2911 4919.0 8379.00 19999 ▇▅▂▁▂
numeric monthly_rate 0 1 NA NA NA 1.431310e+04 7117.7860441 2094 8047 14235.5 20461.50 26999 ▇▇▇▇▇
numeric num_companies_worked 0 1 NA NA NA 2.693197e+00 2.4980090 0 1 2.0 4.00 9 ▇▃▂▂▁
numeric percent_salary_hike 0 1 NA NA NA 1.520952e+01 3.6599377 11 12 14.0 18.00 25 ▇▅▃▂▁
numeric stock_option_level 0 1 NA NA NA 7.938776e-01 0.8520767 0 0 1.0 1.00 3 ▇▇▁▂▁
numeric total_working_years 0 1 NA NA NA 1.127959e+01 7.7807817 0 6 10.0 15.00 40 ▇▇▂▁▁
numeric training_times_last_year 0 1 NA NA NA 2.799320e+00 1.2892706 0 2 3.0 3.00 6 ▂▇▇▂▃
numeric years_at_company 0 1 NA NA NA 7.008163e+00 6.1265252 0 3 5.0 9.00 40 ▇▂▁▁▁
numeric years_in_current_role 0 1 NA NA NA 4.229252e+00 3.6231370 0 2 3.0 7.00 18 ▇▃▂▁▁
numeric years_since_last_promotion 0 1 NA NA NA 2.187755e+00 3.2224303 0 0 1.0 3.00 15 ▇▁▁▁▁
numeric years_with_curr_manager 0 1 NA NA NA 4.123129e+00 3.5681361 0 2 3.0 7.00 17 ▇▂▅▁▁

Count attrition (target)

attrition %>%
     count(attrition)
##   attrition    n
## 1        No 1233
## 2       Yes  237

Our target (attrition) is highly imbalanced.