4.3 Factor Variables

We can use the building type as a factor variable to help with our predictions. The building type variable has five options:

dat %>% count(Bldg_Type)
## # A tibble: 5 × 2
##   Bldg_Type     n
##   <fct>     <int>
## 1 OneFam     2425
## 2 TwoFmCon     62
## 3 Duplex      109
## 4 Twnhs       101
## 5 TwnhsE      233

4.3.1 Dummy Variables

  • One Hot Encoding (KNN, Tree Models) vs P-1 representation (Regression)

One hot encoding is when all factor levels are included in the model. Adding in all P distinct levels along with the intercept term creates collinearity issues.

model.matrix(~Bldg_Type -1, data = dat) %>% head(1)
##   Bldg_TypeOneFam Bldg_TypeTwoFmCon Bldg_TypeDuplex Bldg_TypeTwnhs
## 1               1                 0               0              0
##   Bldg_TypeTwnhsE
## 1               0
  • P-1 encoding (Using all of the factor levels except the reference)

R uses the first factor as the reference level and we should interpret the remaining levels relative to this factor.

lm(Sale_Price ~ total_sf + Lot_Area + bath + Bedroom_AbvGr + Central_Air + 
     Bldg_Type, data = dat) %>% 
  summary() %>% broom::tidy()
## # A tibble: 10 × 5
##    term                estimate std.error statistic   p.value
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)         1361.     4666.        0.292 7.71e-  1
##  2 total_sf             100.        2.51     39.9   1.19e-278
##  3 Lot_Area               0.430     0.118     3.65  2.68e-  4
##  4 bath               28825.     1393.       20.7   7.17e- 89
##  5 Bedroom_AbvGr     -22921.     1325.      -17.3   6.49e- 64
##  6 Central_AirY       32779.     3661.        8.95  6.04e- 19
##  7 Bldg_TypeTwoFmCon -38089.     6151.       -6.19  6.77e- 10
##  8 Bldg_TypeDuplex   -47689.     4755.      -10.0   2.71e- 23
##  9 Bldg_TypeTwnhs    -35466.     4856.       -7.30  3.60e- 13
## 10 Bldg_TypeTwnhsE    -4281.     3489.       -1.23  2.20e-  1

4.3.2 Ordered Factor Variables

Treating ordered factors as a numerical variable preserves the information contained in the ordering that would be lost if we simply used a factor conversion (Likert scales, Loan grades, Crime rate, etc).