4.3 Factor Variables
We can use the building type as a factor variable to help with our predictions. The building type variable has five options:
## # A tibble: 5 × 2
## Bldg_Type n
## <fct> <int>
## 1 OneFam 2425
## 2 TwoFmCon 62
## 3 Duplex 109
## 4 Twnhs 101
## 5 TwnhsE 233
4.3.1 Dummy Variables
- One Hot Encoding (KNN, Tree Models) vs P-1 representation (Regression)
One hot encoding is when all factor levels are included in the model. Adding in all P distinct levels along with the intercept term creates collinearity issues.
## Bldg_TypeOneFam Bldg_TypeTwoFmCon Bldg_TypeDuplex Bldg_TypeTwnhs
## 1 1 0 0 0
## Bldg_TypeTwnhsE
## 1 0
- P-1 encoding (Using all of the factor levels except the reference)
R uses the first factor as the reference level and we should interpret the remaining levels relative to this factor.
lm(Sale_Price ~ total_sf + Lot_Area + bath + Bedroom_AbvGr + Central_Air +
Bldg_Type, data = dat) %>%
summary() %>% broom::tidy()
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1361. 4666. 0.292 7.71e- 1
## 2 total_sf 100. 2.51 39.9 1.19e-278
## 3 Lot_Area 0.430 0.118 3.65 2.68e- 4
## 4 bath 28825. 1393. 20.7 7.17e- 89
## 5 Bedroom_AbvGr -22921. 1325. -17.3 6.49e- 64
## 6 Central_AirY 32779. 3661. 8.95 6.04e- 19
## 7 Bldg_TypeTwoFmCon -38089. 6151. -6.19 6.77e- 10
## 8 Bldg_TypeDuplex -47689. 4755. -10.0 2.71e- 23
## 9 Bldg_TypeTwnhs -35466. 4856. -7.30 3.60e- 13
## 10 Bldg_TypeTwnhsE -4281. 3489. -1.23 2.20e- 1
4.3.2 Ordered Factor Variables
Treating ordered factors as a numerical variable preserves the information contained in the ordering that would be lost if we simply used a factor conversion (Likert scales, Loan grades, Crime rate, etc).