10.2 Resubstitution approach
In chapter 8, we fit a linear model to the training set. This is candidate model #1.
library(tidymodels)
data(ames)
<- mutate(ames, Sale_Price = log10(Sale_Price))
ames
set.seed(123)
<- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_split <- training(ames_split)
ames_train <- testing(ames_split)
ames_test
<-
ames_rec recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
+ Longitude, data = ames_train) %>%
Latitude step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
<- linear_reg() %>% set_engine("lm")
lm_model
<-
lm_wflow workflow() %>%
add_model(lm_model) %>%
add_recipe(ames_rec)
<- fit(lm_wflow, ames_train) lm_fit
We can fit a random forest model the training data, and this random forest model will be candidate #2.
<-
rf_model rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
<-
rf_wflow workflow() %>%
add_formula(
~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Sale_Price + Longitude) %>%
Latitude add_model(rf_model)
<- rf_wflow %>% fit(data = ames_train) rf_fit
We can compare the performance of these two candidate models by
<- function(model, dat) {
estimate_perf # Capture the names of the objects used
<- match.call()
cl <- as.character(cl$model)
obj_name <- as.character(cl$dat)
data_name <- gsub("ames_", "", data_name)
data_name
# Estimate these metrics:
<- metric_set(rmse, rsq)
reg_metrics
%>%
model predict(dat) %>%
bind_cols(dat %>% select(Sale_Price)) %>%
reg_metrics(Sale_Price, .pred) %>%
select(-.estimator) %>%
mutate(object = obj_name, data = data_name)
}
Resubstitution errors for the linear model:
estimate_perf(lm_fit, ames_train)
## # A tibble: 2 × 4
## .metric .estimate object data
## <chr> <dbl> <chr> <chr>
## 1 rmse 0.0751 lm_fit train
## 2 rsq 0.819 lm_fit train
Resubstitution errors for the random forest model:
estimate_perf(rf_fit, ames_train)
## # A tibble: 2 × 4
## .metric .estimate object data
## <chr> <dbl> <chr> <chr>
## 1 rmse 0.0367 rf_fit train
## 2 rsq 0.960 rf_fit train
We can see that the random forest model performs significantly better (2+ times as better). We can choose the random forest model as our only candidate model. We can test its performance on the test set:
estimate_perf(rf_fit, ames_test)
## # A tibble: 2 × 4
## .metric .estimate object data
## <chr> <dbl> <chr> <chr>
## 1 rmse 0.0694 rf_fit test
## 2 rsq 0.853 rf_fit test
The random forest model, which previously performed well on the training set, is significantly worse on the test set. Just out of curiosity, we can see how the linear model performs on the test set:
It’s interesting that the linear model performs similarly (bad) for the training and test sets, while the random forest did not. This is because linear models are considered high bias models and random forest models are low bias:
In this context, bias is the difference between the true data pattern and the types
of patterns that the model can emulate. Many black-box machine learning models have low bias. Other models (such as linear/logistic regression, discriminant analysis, and others) are not as adaptable and are considered high-bias models.
Re-predicting the training set is bad for performance evaluation. We need to use resampling methods.