8.4 HOW DATA ARE USED BY THE RECIPE

8.4.1 EXAMPLES OF RECIPE STEPS:

  1. ENCODING QUALITATIVE DATA IN A NUMERIC FORMAT
  • step_unknown() change missing values to a dedicated factor level

  • step_novel() a new factor level may be encountered in future data

  • step_other() to analyze the frequencies of the factor levels

the bottom 1% of the neighborhoods will be lumped into a new level called “other”

    step_other(Neighborhood, threshold = 0.01)
  • step_dummy() for converting a factor predictor to a numeric format (one_hot argument to include the reference variable)
simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors())
  1. INTERACTION TERMS
  • step_interact(~ interaction terms)

one predictor has an effect on the outcome that is contingent on one or more other predictors

Numerically, an interaction term between predictors is encoded as their product.

ggplot(ames_train, aes(x = Gr_Liv_Area, y = Sale_Price)) + 
  geom_point(alpha = .2) + 
  facet_wrap(~ Bldg_Type) + 
  geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "lightblue") + 
  scale_x_log10() + 
  scale_y_log10() + 
  labs(x = "Gross Living Area", y = "Sale Price (USD)")

Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type + log10(Gr_Liv_Area):Bldg_Type

or

Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) * Bldg_Type 
simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  # Gr_Liv_Area is on the log scale from a previous step
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") )

Additional interactions can be specified in this formula by separating them by +

  1. SPLINE FUNCTIONS
  • step_ns() for non-linear relationships

Splines replace the existing numeric predictor with a set of columns that allow a model to emulate a flexible, non-linear relationship

library(patchwork)
library(splines)

plot_smoother <- function(deg_free) {
  ggplot(ames_train, aes(x = Latitude, y = 10^Sale_Price)) + 
    geom_point(alpha = .2) + 
    scale_y_log10() +
    geom_smooth(
      method = lm,
      formula = y ~ ns(x, df = deg_free),
      color = "lightblue",
      se = FALSE
    ) +
    labs(title = paste(deg_free, "Spline Terms"),
         y = "Sale Price (USD)")
}

( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )

recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, deg_free = 20)
  1. FEATURE EXTRACTION
  • step_normalize() center and scale each column

  • step_pca() principal component analysis, it extracts as much of the original information in the predictor set as possible using a smaller number of features,reducing the correlation between predictors.

    step_pca(matches(“(SF$)|(Gr_Liv)”))

  • step_ica() independent component analysis

  • step_umap() uniform manifold approximation and projection

  1. ROW SAMPLING STEPS
  • Downsampling ({themis} package)

    step_downsample(outcome_column_name)

  • Upsampling

  • Hybrid methods

step_filter(), step_sample(), step_slice(), and step_arrange() …

  1. GENERAL TRANSFORMATIONS
  • step_mutate()
  1. NATURAL LANGUAGE PROCESSING

can apply natural language processing methods to the data