Putting the pieces together
The first step is where you define your blueprint (aka recipe). With this process, you supply the formula of interest (the target variable, features, and the data these are based on) with recipe()
and then you sequentially add feature engineering steps with step_xxx()
.
Remove near-zero variance features that are categorical (aka nominal).
Ordinal encode our quality-based features (which are inherently ordinal).
Center and scale (i.e., standardize) all numeric features.
One-hot encode our remaining categorical features.
<- recipe(Sale_Price ~ ., data = ames_train) %>%
blueprint step_nzv(all_nominal()) %>%
step_integer(matches("Qual|Cond|QC|Qu")) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
blueprint
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Sparse, unbalanced variable filter on: all_nominal()
## • Integer encoding for: matches("Qual|Cond|QC|Qu")
## • Centering and scaling for: all_numeric(), -all_outcomes()
## • Dummy variables from: all_nominal(), -all_outcomes()
Next, we need to train this blueprint on some training data. Remember, there are many feature engineering steps that we do not want to train on the test data (e.g., standardize and PCA) as this would create data leakage. So in this step we estimate these parameters based on the training data of interest.
Lastly, we can apply our blueprint to new data (e.g., the training data or future test data) with bake()
.
%>%
blueprint prep() %>%
bake(new_data = ames_train) %>%
glimpse()
## Rows: 2,049
## Columns: 220
## $ Lot_Frontage <dbl> -1.13208589, -1.…
## $ Lot_Area <dbl> -1.04082032, -1.…
## $ Condition_1 <dbl> -0.04804813, -0.…
## $ Overall_Qual <dbl> -0.77351309, -0.…
## $ Overall_Cond <dbl> -0.4967950, -0.4…
## $ Year_Built <dbl> -0.01478954, -0.…
## $ Year_Remod_Add <dbl> -0.63615965, -0.…
## $ Mas_Vnr_Area <dbl> 2.08710198, 1.48…
## $ Exter_Qual <dbl> 0.6724948, 0.672…
## $ Exter_Cond <dbl> 0.377337, 0.3773…
## $ Bsmt_Qual <dbl> 1.07938607, 1.07…
## $ BsmtFin_SF_1 <dbl> 0.8154852, 1.262…
## $ BsmtFin_SF_2 <dbl> -0.2907006, -0.2…
## $ Bsmt_Unf_SF <dbl> -0.75734818, -0.…
## $ Total_Bsmt_SF <dbl> -1.18297005, -1.…
## $ Heating_QC <dbl> 1.4517086, 1.451…
## $ First_Flr_SF <dbl> -1.60448826, -1.…
## $ Second_Flr_SF <dbl> 0.072573874, 0.0…
## $ Low_Qual_Fin_SF <dbl> -0.1010541, -0.1…
## $ Gr_Liv_Area <dbl> -0.79558772, -0.…
## $ Bsmt_Full_Bath <dbl> -0.8225024, -0.8…
## $ Bsmt_Half_Bath <dbl> -0.2463384, -0.2…
## $ Full_Bath <dbl> -1.029885, -1.02…
## $ Half_Bath <dbl> 1.2456257, 1.245…
## $ Bedroom_AbvGr <dbl> 0.1637802, 0.163…
## $ Kitchen_AbvGr <dbl> -0.2134855, -0.2…
## $ Kitchen_Qual <dbl> 0.9051278, 0.905…
## $ TotRms_AbvGrd <dbl> -0.2964932, -0.2…
## $ Fireplaces <dbl> -0.9212202, -0.9…
## $ Fireplace_Qu <dbl> -0.07498851, -0.…
## $ Garage_Cars <dbl> -1.0205957, -1.0…
## $ Garage_Area <dbl> -0.71662117, -0.…
## $ Garage_Qual <dbl> 0.3203747, 0.320…
## $ Garage_Cond <dbl> 0.2904161, 0.290…
## $ Wood_Deck_SF <dbl> -0.74509691, -0.…
## $ Open_Porch_SF <dbl> -0.69687402, -0.…
## $ Enclosed_Porch <dbl> -0.3513871, -0.3…
## $ Three_season_porch <dbl> -0.1067458, -0.1…
## $ Screen_Porch <dbl> -0.2886199, -0.2…
## $ Pool_Area <dbl> -0.06496428, -0.…
## $ Misc_Val <dbl> -0.08126785, -0.…
## $ Mo_Sold <dbl> -1.16165655, -1.…
## $ Year_Sold <dbl> 1.67927, 1.67927…
## $ Sale_Condition <dbl> 0.217264, -0.689…
## $ Longitude <dbl> 0.6035963, 0.613…
## $ Latitude <dbl> 0.95162362, 0.95…
## $ Sale_Price <int> 105500, 88000, 1…
## $ MS_SubClass_One_Story_1946_and_Newer_All_Styles <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_One_Story_1945_and_Older <dbl> 0, 0, 0, 0, 1, 0…
## $ MS_SubClass_One_Story_with_Finished_Attic_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_One_and_Half_Story_Unfinished_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_One_and_Half_Story_Finished_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Two_Story_1946_and_Newer <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Two_Story_1945_and_Older <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Two_and_Half_Story_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Split_or_Multilevel <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Split_Foyer <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Duplex_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 1…
## $ MS_SubClass_One_Story_PUD_1946_and_Newer <dbl> 0, 0, 1, 1, 0, 0…
## $ MS_SubClass_One_and_Half_Story_PUD_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Two_Story_PUD_1946_and_Newer <dbl> 1, 1, 0, 0, 0, 0…
## $ MS_SubClass_PUD_Multilevel_Split_Level_Foyer <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_SubClass_Two_Family_conversion_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_Zoning_Floating_Village_Residential <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_Zoning_Residential_High_Density <dbl> 0, 0, 0, 0, 1, 0…
## $ MS_Zoning_Residential_Low_Density <dbl> 0, 0, 1, 1, 0, 0…
## $ MS_Zoning_Residential_Medium_Density <dbl> 1, 1, 0, 0, 0, 1…
## $ MS_Zoning_A_agr <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_Zoning_C_all <dbl> 0, 0, 0, 0, 0, 0…
## $ MS_Zoning_I_all <dbl> 0, 0, 0, 0, 0, 0…
## $ Lot_Shape_Regular <dbl> 1, 1, 1, 1, 1, 1…
## $ Lot_Shape_Slightly_Irregular <dbl> 0, 0, 0, 0, 0, 0…
## $ Lot_Shape_Moderately_Irregular <dbl> 0, 0, 0, 0, 0, 0…
## $ Lot_Shape_Irregular <dbl> 0, 0, 0, 0, 0, 0…
## $ Lot_Config_Corner <dbl> 0, 0, 0, 0, 1, 0…
## $ Lot_Config_CulDSac <dbl> 0, 0, 0, 0, 0, 0…
## $ Lot_Config_FR2 <dbl> 0, 0, 1, 0, 0, 0…
## $ Lot_Config_FR3 <dbl> 0, 0, 0, 0, 0, 0…
## $ Lot_Config_Inside <dbl> 1, 1, 0, 1, 0, 1…
## $ Neighborhood_North_Ames <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_College_Creek <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Old_Town <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Edwards <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Somerset <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Northridge_Heights <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Gilbert <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Sawyer <dbl> 0, 0, 0, 0, 0, 1…
## $ Neighborhood_Northwest_Ames <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Sawyer_West <dbl> 0, 0, 0, 1, 1, 0…
## $ Neighborhood_Mitchell <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Brookside <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Crawford <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Iowa_DOT_and_Rail_Road <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Timberland <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Northridge <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Stone_Brook <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_South_and_West_of_Iowa_State_University <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Clear_Creek <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Meadow_Village <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Briardale <dbl> 1, 1, 0, 0, 0, 0…
## $ Neighborhood_Bloomington_Heights <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Veenker <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Northpark_Villa <dbl> 0, 0, 1, 0, 0, 0…
## $ Neighborhood_Blueste <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Greens <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Green_Hills <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Landmark <dbl> 0, 0, 0, 0, 0, 0…
## $ Neighborhood_Hayden_Lake <dbl> 0, 0, 0, 0, 0, 0…
## $ Bldg_Type_OneFam <dbl> 0, 0, 0, 0, 1, 0…
## $ Bldg_Type_TwoFmCon <dbl> 0, 0, 0, 0, 0, 0…
## $ Bldg_Type_Duplex <dbl> 0, 0, 0, 0, 0, 1…
## $ Bldg_Type_Twnhs <dbl> 1, 1, 1, 0, 0, 0…
## $ Bldg_Type_TwnhsE <dbl> 0, 0, 0, 1, 0, 0…
## $ House_Style_One_and_Half_Fin <dbl> 0, 0, 0, 0, 0, 1…
## $ House_Style_One_and_Half_Unf <dbl> 0, 0, 0, 0, 0, 0…
## $ House_Style_One_Story <dbl> 0, 0, 1, 1, 1, 0…
## $ House_Style_SFoyer <dbl> 0, 0, 0, 0, 0, 0…
## $ House_Style_SLvl <dbl> 0, 0, 0, 0, 0, 0…
## $ House_Style_Two_and_Half_Fin <dbl> 0, 0, 0, 0, 0, 0…
## $ House_Style_Two_and_Half_Unf <dbl> 0, 0, 0, 0, 0, 0…
## $ House_Style_Two_Story <dbl> 1, 1, 0, 0, 0, 0…
## $ Roof_Style_Flat <dbl> 0, 0, 0, 0, 0, 0…
## $ Roof_Style_Gable <dbl> 1, 1, 1, 1, 1, 1…
## $ Roof_Style_Gambrel <dbl> 0, 0, 0, 0, 0, 0…
## $ Roof_Style_Hip <dbl> 0, 0, 0, 0, 0, 0…
## $ Roof_Style_Mansard <dbl> 0, 0, 0, 0, 0, 0…
## $ Roof_Style_Shed <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_AsbShng <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_AsphShn <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_BrkComm <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_BrkFace <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_CBlock <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_CemntBd <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_HdBoard <dbl> 1, 1, 0, 0, 0, 0…
## $ Exterior_1st_ImStucc <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_MetalSd <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_Plywood <dbl> 0, 0, 1, 1, 0, 0…
## $ Exterior_1st_PreCast <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_Stone <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_Stucco <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_1st_VinylSd <dbl> 0, 0, 0, 0, 0, 1…
## $ Exterior_1st_Wd.Sdng <dbl> 0, 0, 0, 0, 1, 0…
## $ Exterior_1st_WdShing <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_AsbShng <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_AsphShn <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_Brk.Cmn <dbl> 0, 0, 1, 0, 0, 0…
## $ Exterior_2nd_BrkFace <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_CBlock <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_CmentBd <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_HdBoard <dbl> 1, 0, 0, 0, 0, 0…
## $ Exterior_2nd_ImStucc <dbl> 0, 1, 0, 0, 0, 0…
## $ Exterior_2nd_MetalSd <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_Other <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_Plywood <dbl> 0, 0, 0, 1, 0, 0…
## $ Exterior_2nd_PreCast <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_Stone <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_Stucco <dbl> 0, 0, 0, 0, 0, 0…
## $ Exterior_2nd_VinylSd <dbl> 0, 0, 0, 0, 0, 1…
## $ Exterior_2nd_Wd.Sdng <dbl> 0, 0, 0, 0, 1, 0…
## $ Exterior_2nd_Wd.Shng <dbl> 0, 0, 0, 0, 0, 0…
## $ Mas_Vnr_Type_BrkCmn <dbl> 0, 0, 0, 0, 0, 0…
## $ Mas_Vnr_Type_BrkFace <dbl> 1, 1, 0, 0, 0, 0…
## $ Mas_Vnr_Type_CBlock <dbl> 0, 0, 0, 0, 0, 0…
## $ Mas_Vnr_Type_None <dbl> 0, 0, 1, 1, 1, 1…
## $ Mas_Vnr_Type_Stone <dbl> 0, 0, 0, 0, 0, 0…
## $ Foundation_BrkTil <dbl> 0, 0, 0, 0, 1, 0…
## $ Foundation_CBlock <dbl> 1, 1, 1, 1, 0, 0…
## $ Foundation_PConc <dbl> 0, 0, 0, 0, 0, 0…
## $ Foundation_Slab <dbl> 0, 0, 0, 0, 0, 1…
## $ Foundation_Stone <dbl> 0, 0, 0, 0, 0, 0…
## $ Foundation_Wood <dbl> 0, 0, 0, 0, 0, 0…
## $ Bsmt_Exposure_Av <dbl> 0, 0, 0, 0, 0, 0…
## $ Bsmt_Exposure_Gd <dbl> 0, 0, 0, 0, 0, 0…
## $ Bsmt_Exposure_Mn <dbl> 0, 0, 0, 0, 0, 0…
## $ Bsmt_Exposure_No <dbl> 1, 1, 1, 1, 1, 0…
## $ Bsmt_Exposure_No_Basement <dbl> 0, 0, 0, 0, 0, 1…
## $ BsmtFin_Type_1_ALQ <dbl> 0, 0, 0, 1, 0, 0…
## $ BsmtFin_Type_1_BLQ <dbl> 0, 0, 0, 0, 0, 0…
## $ BsmtFin_Type_1_GLQ <dbl> 0, 0, 0, 0, 0, 0…
## $ BsmtFin_Type_1_LwQ <dbl> 0, 0, 0, 0, 0, 0…
## $ BsmtFin_Type_1_No_Basement <dbl> 0, 0, 0, 0, 0, 1…
## $ BsmtFin_Type_1_Rec <dbl> 1, 0, 0, 0, 0, 0…
## $ BsmtFin_Type_1_Unf <dbl> 0, 1, 1, 0, 1, 0…
## $ Central_Air_N <dbl> 0, 0, 0, 0, 1, 0…
## $ Central_Air_Y <dbl> 1, 1, 1, 1, 0, 1…
## $ Electrical_FuseA <dbl> 0, 0, 0, 0, 1, 0…
## $ Electrical_FuseF <dbl> 0, 0, 0, 0, 0, 0…
## $ Electrical_FuseP <dbl> 0, 0, 0, 0, 0, 0…
## $ Electrical_Mix <dbl> 0, 0, 0, 0, 0, 0…
## $ Electrical_SBrkr <dbl> 1, 1, 1, 1, 0, 1…
## $ Electrical_Unknown <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Type_Attchd <dbl> 0, 0, 1, 1, 0, 1…
## $ Garage_Type_Basment <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Type_BuiltIn <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Type_CarPort <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Type_Detchd <dbl> 1, 1, 0, 0, 1, 0…
## $ Garage_Type_More_Than_Two_Types <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Type_No_Garage <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Finish_Fin <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Finish_No_Garage <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Finish_RFn <dbl> 0, 0, 0, 0, 0, 0…
## $ Garage_Finish_Unf <dbl> 1, 1, 1, 1, 1, 1…
## $ Paved_Drive_Dirt_Gravel <dbl> 0, 0, 0, 0, 0, 0…
## $ Paved_Drive_Partial_Pavement <dbl> 0, 0, 0, 0, 0, 0…
## $ Paved_Drive_Paved <dbl> 1, 1, 1, 1, 1, 1…
## $ Fence_Good_Privacy <dbl> 0, 0, 0, 0, 0, 0…
## $ Fence_Good_Wood <dbl> 0, 0, 0, 0, 0, 0…
## $ Fence_Minimum_Privacy <dbl> 0, 0, 0, 1, 0, 0…
## $ Fence_Minimum_Wood_Wire <dbl> 0, 0, 0, 0, 0, 0…
## $ Fence_No_Fence <dbl> 1, 1, 1, 0, 1, 1…
## $ Sale_Type_COD <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_Con <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_ConLD <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_ConLI <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_ConLw <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_CWD <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_New <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_Oth <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_VWD <dbl> 0, 0, 0, 0, 0, 0…
## $ Sale_Type_WD. <dbl> 1, 1, 1, 1, 1, 1…
Next, we apply the same resampling method and hyperparameter search grid as we did in Section 2.7. The only difference is when we train our resample models with train()
, we supply our blueprint as the first argument and then caret takes care of the rest.
# Specify resampling plan
<- trainControl(
cv method = "repeatedcv",
number = 10,
repeats = 5
)
# Construct grid of hyperparameter values
<- expand.grid(k = seq(2, 25, by = 1))
hyper_grid
# Tune a knn model using grid search
<- train(
knn_fit2
blueprint, data = ames_train,
method = "knn",
trControl = cv,
tuneGrid = hyper_grid,
metric = "RMSE"
)
# print model results
knn_fit2
## k-Nearest Neighbors
##
## 2049 samples
## 80 predictor
##
## Recipe steps: nzv, integer, normalize, dummy
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1844, 1845, 1843, 1844, 1844, 1845, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 2 35718.34 0.8083251 22732.57
## 3 34931.41 0.8187706 21941.28
## 4 34794.68 0.8230095 21713.76
## 5 34607.51 0.8275359 21484.67
## 6 34395.72 0.8319064 21252.76
## 7 34139.63 0.8356526 21067.31
## 8 33929.29 0.8394225 20957.91
## 9 33733.30 0.8426560 20865.28
## 10 33719.86 0.8432838 20868.48
## 11 33833.91 0.8431627 20917.45
## 12 33894.13 0.8437289 21003.84
## 13 33998.35 0.8432144 21105.27
## 14 34142.54 0.8427766 21195.34
## 15 34208.80 0.8428515 21256.55
## 16 34306.03 0.8422919 21308.34
## 17 34428.02 0.8417635 21386.85
## 18 34572.65 0.8406348 21454.44
## 19 34611.91 0.8407819 21510.62
## 20 34700.15 0.8404275 21572.50
## 21 34833.19 0.8395586 21635.17
## 22 34822.01 0.8402366 21656.58
## 23 34904.71 0.8399106 21702.15
## 24 35003.25 0.8394481 21769.57
## 25 35061.56 0.8392900 21788.96
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 10.
# plot cross validation results
ggplot(knn_fit2)
Looking at our results we see that the best model was associated with k
= 10, which resulted in a cross-validated RMSE of 33,400.