Capítulo 8 Feature engineering with recipes

Learning objectives:

  • Define feature engineering.
  • List reasons that feature engineering might be beneficial.
  • Use the {recipes} package to create a simple feature engineering recipe.
  • Use selectors from the {recipes} package to apply transformations to specific types of columns.
  • List some advantages of using a recipe for feature engineering.
  • Describe what happens when a recipe is prepared with recipes::prep().
  • Use recipes::bake() to process a dataset.
  • Recognize how to use recipes::step_unknown(), recipes::step_novel(), recipes::step_other() to prepare factor variables.
  • Explain how recipes::step_dummy() encodes qualitative data in a numeric format.
  • Recognize techniques for dealing with large numbers of categories, such as feature hashing or encoding using the {embed} package (as described in this talk by Alan Feder at rstudio::global(2021)).
  • Recognize methods for encoding ordered factors.
  • Use recipes::step_interact() to add interaction terms to a recipe.
  • Understand why some steps might only be applicable to training data.
  • Recognize the functions from {recipes} and {themis} that are only applied to training data by default.
  • Recognize that {recipes} includes functions for creating spline terms, such as step_ns().
  • Recognize that {recipes} includes functions for feature extraction, such as step_pca().
  • Use themis::step_downsample() to downsample data.
  • Recognize other row-sampling steps from the {recipes} package.
  • Use recipes::step_mutate() and recipes::step_mutate_at() for general {dplyr}-like transformations.
  • Recall that the {textrecipes} package exists for text-specific feature-engineering steps.
  • Understand that the functions of the {recipes} package use training data for all preprocessing and feature engineering steps to prevent leakage.
  • Use {recipes} to prepare data for traditional modeling functions.
  • Use tidy() to examine a recipe and its steps.
  • Refer to columns with roles other than "predictor" or "outcome".