10.2 Resubstitution approach

In chapter 8, we fit a linear model to the training set. This is candidate model #1.

library(tidymodels)
data(ames)

ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)

We can fit a random forest model the training data, and this random forest model will be candidate #2.

rf_model <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

rf_wflow <- 
  workflow() %>% 
  add_formula(
    Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
      Latitude + Longitude) %>% 
  add_model(rf_model) 

rf_fit <- rf_wflow %>% fit(data = ames_train)

We can compare the performance of these two candidate models by

estimate_perf <- function(model, dat) {
  # Capture the names of the objects used
  cl <- match.call()
  obj_name <- as.character(cl$model)
  data_name <- as.character(cl$dat)
  data_name <- gsub("ames_", "", data_name)
  
  # Estimate these metrics:
  reg_metrics <- metric_set(rmse, rsq)
  
  model %>% 
    predict(dat) %>% 
    bind_cols(dat %>% select(Sale_Price)) %>% 
    reg_metrics(Sale_Price, .pred) %>% 
    select(-.estimator) %>% 
    mutate(object = obj_name, data = data_name)
}

Resubstitution errors for the linear model:

estimate_perf(lm_fit, ames_train)

## # A tibble: 2 × 4
##   .metric .estimate object data 
##   <chr>       <dbl> <chr>  <chr>
## 1 rmse       0.0751 lm_fit train
## 2 rsq        0.819  lm_fit train

Resubstitution errors for the random forest model:

estimate_perf(rf_fit, ames_train)

## # A tibble: 2 × 4
##   .metric .estimate object data 
##   <chr>       <dbl> <chr>  <chr>
## 1 rmse       0.0367 rf_fit train
## 2 rsq        0.960  rf_fit train

We can see that the random forest model performs significantly better (2+ times as better). We can choose the random forest model as our only candidate model. We can test its performance on the test set:

estimate_perf(rf_fit, ames_test)

## # A tibble: 2 × 4
##   .metric .estimate object data 
##   <chr>       <dbl> <chr>  <chr>
## 1 rmse       0.0694 rf_fit test 
## 2 rsq        0.853  rf_fit test

The random forest model, which previously performed well on the training set, is significantly worse on the test set. Just out of curiosity, we can see how the linear model performs on the test set:

It’s interesting that the linear model performs similarly (bad) for the training and test sets, while the random forest did not. This is because linear models are considered high bias models and random forest models are low bias:

In this context, bias is the difference between the true data pattern and the types
of patterns that the model can emulate. Many black-box machine learning models have low bias. Other models (such as linear/logistic regression, discriminant analysis, and others) are not as adaptable and are considered high-bias models.
— Max Kuhn and Julia Silge

Re-predicting the training set is bad for performance evaluation. We need to use resampling methods.