• Tidy Modeling with R Book Club
  • Welcome
    • Book club meetings
  • 1 Software for modeling
    • 1.1 The pit of success
    • 1.2 Types of models
    • 1.3 Terminology
    • 1.4 The data analysis process
    • 1.5 The modeling process
    • 1.6 Meeting Videos
      • 1.6.1 Cohort 1
      • 1.6.2 Cohort 2
      • 1.6.3 Cohort 3
      • 1.6.4 Cohort 4
  • 2 A tidyverse primer
    • 2.1 Tidyverse design Principles
    • 2.2 Design for Humans - Overview
    • 2.3 Design for Humans and the Tidyverse
    • 2.4 Reusing existing data structures
    • 2.5 Designed for the pipe
    • 2.6 Designed for Functional Programming
    • 2.7 Tibbles vs. Data Frames
    • 2.8 How to read and wrangle data
    • 2.9 Further Reading
    • 2.10 Meeting Videos
      • 2.10.1 Cohort 1
      • 2.10.2 Cohort 2
      • 2.10.3 Cohort 3
      • 2.10.4 Cohort 4
  • 3 A review of R modeling fundamentals
    • 3.1 R formula syntax
      • 3.1.1 Recap
    • 3.2 Inspecting and developing models
    • 3.3 More of {base} and {stats}
    • 3.4 Why Tidy Principles and {tidymodels}?
    • 3.5 Meeting Videos
      • 3.5.1 Cohort 1
      • 3.5.2 Cohort 2
      • 3.5.3 Cohort 3
      • 3.5.4 Cohort 4
  • Basics
  • 4 The Ames housing data
    • 4.1 Pittsburgh: a parallel real world example
    • 4.2 Meeting Videos
      • 4.2.1 Cohort 1
      • 4.2.2 Cohort 2
      • 4.2.3 Cohort 3
      • 4.2.4 Cohort 4
  • 5 Spending our data
    • 5.1 Spending our data
    • 5.2 Common methods for splitting data
    • 5.3 Class imbalance
      • 5.3.1 Stratified sampling simulation
    • 5.4 Continuous outcome data
    • 5.5 Time series data
    • 5.6 Multi-level data
    • 5.7 What proportion should be used?
    • 5.8 Summary
      • 5.8.1 References
    • 5.9 Meeting Videos
      • 5.9.1 Cohort 1
      • 5.9.2 Cohort 2
      • 5.9.3 Cohort 3
      • 5.9.4 Cohort 4
  • 6 Fitting models with parsnip
    • 6.1 Create a Model
      • 6.1.1 Different Model Interfaces
      • 6.1.2 Model Specification
      • 6.1.3 Model Fitting
      • 6.1.4 Generalized Model Arguments
    • 6.2 Use Model results
    • 6.3 Make Predictions
    • 6.4 {tidymodels}-Adjacent Packages
    • 6.5 Summary
    • 6.6 Meeting Videos
      • 6.6.1 Cohort 1
      • 6.6.2 Cohort 2
      • 6.6.3 Cohort 3
      • 6.6.4 Cohort 4
  • AI Ethics
    • 6.7 Meeting Videos
      • 6.7.1 Cohort 2
  • 7 A model workflow
    • 7.1 Workflows
    • 7.2 Demonstration
      • 7.2.1 Some data exploration and cleaning
    • 7.3 Modeling with workflows
      • 7.3.1 Different model, same recipe
      • 7.3.2 Same model, different preprocessing
    • 7.4 Managing many workflows
    • 7.5 Notes
    • 7.6 Meeting Videos
      • 7.6.1 Cohort 1
      • 7.6.2 Cohort 2
      • 7.6.3 Cohort 3
      • 7.6.4 Cohort 4
  • 8 Feature engineering with recipes
    • 8.1 In summary
      • 8.1.1 Intoduction
    • 8.2 A SIMPLE RECIPE FOR THE AMES HOUSING DATA
    • 8.3 USING RECIPES
    • 8.4 HOW DATA ARE USED BY THE RECIPE
      • 8.4.1 EXAMPLES OF RECIPE STEPS:
    • 8.5 SKIPPING STEPS FOR NEW DATA
    • 8.6 TIDY A RECIPE
    • 8.7 COLUMN ROLES
    • 8.8 Resources
    • 8.9 Meeting Videos
      • 8.9.1 Cohort 1
      • 8.9.2 Cohort 2
      • 8.9.3 Cohort 3
      • 8.9.4 Cohort 4
  • 9 Judging model effectiveness
    • 9.1 Performance Metrics and Inference
    • 9.2 A little Recap of previous chapters
      • 9.2.1 Case Study 1
      • 9.2.2 Tidymodels: modeling as a step by step mode
      • 9.2.3 Workflow: to combine models and recipes
      • 9.2.4 Case Study 2
    • 9.3 Functions used to measure predictive strengths of a model
      • 9.3.1 Case Study 3
      • 9.3.2 Conclusion
    • 9.4 Measures of Model Fit - Case Study (Cohort 1)
    • 9.5 Disclaimers
    • 9.6 Regression Metrics
    • 9.7 Binary Classification Metrics
    • 9.8 References
    • 9.9 Meeting Videos
      • 9.9.1 Cohort 1
      • 9.9.2 Cohort 2
      • 9.9.3 Cohort 3
      • 9.9.4 Cohort 4
  • Review of chapters 4-9
    • 9.10 Meeting Videos
      • 9.10.1 Cohort 1
  • Tools for Creating Effective Models
  • 10 Resampling for evaluating performance
    • 10.1 Why?
    • 10.2 Resubstitution approach
    • 10.3 Resampling methods
      • 10.3.1 Cross-validation
      • 10.3.2 Validation sets
      • 10.3.3 Boostrapping
      • 10.3.4 Rolling forecasting origin resampling
    • 10.4 Estimating performance
    • 10.5 Parallel processing
    • 10.6 Saving the resampled objects
    • 10.7 Meeting Videos
      • 10.7.1 Cohort 1
      • 10.7.2 Cohort 2
      • 10.7.3 Cohort 3
      • 10.7.4 Cohort 4
  • 11 Comparing models with resampling
    • 11.1 Calculate performance statistics
    • 11.2 Calculate performance statistics: {workflowsets}
    • 11.3 Within-resample correlation
    • 11.4 Practical effect size
    • 11.5 Simple Comparison
    • 11.6 Bayesian methods
    • 11.7 Meeting Videos
      • 11.7.1 Cohort 1
      • 11.7.2 Cohort 2
      • 11.7.3 Cohort 3
      • 11.7.4 Cohort 4
  • 12 Model tuning and the dangers of overfitting
    • 12.1 What is a Tuning Parameter?
      • 12.1.1 Examples
    • 12.2 When not to tune
    • 12.3 Decisions, Decisions…
    • 12.4 What Metric Should We Use?
    • 12.5 Can we make our model too good?
    • 12.6 Tuning Parameter Optimization Strategies
    • 12.7 Tuning Parameters in tidymodels {dials}
    • 12.8 Let’s try an example:
    • 12.9 Build our random forest model:
    • 12.10 Add tuning parameters:
    • 12.11 Updating tuning parameters:
    • 12.12 Finalizing tuning parameters:
    • 12.13 What is next?
    • 12.14 Meeting Videos
      • 12.14.1 Cohort 1
      • 12.14.2 Cohort 2
      • 12.14.3 Cohort 3
      • 12.14.4 Cohort 4
  • 13 Grid search
    • 13.1 Regular and non-regular grids
      • 13.1.1 Regular Grids
      • 13.1.2 Irregular Grids
    • 13.2 Evaluating the grid
    • 13.3 Finalizing the model
    • 13.4 Tools for efficient grid search
      • 13.4.1 Submodel optimization
      • 13.4.2 Parallel processing
      • 13.4.3 Benchmarking Parallel with boosted trees
      • 13.4.4 Racing Methods
    • 13.5 Chapter Summary
    • 13.6 Meeting Videos
      • 13.6.1 Cohort 1
      • 13.6.2 Cohort 2
      • 13.6.3 Cohort 3
      • 13.6.4 Cohort 4
  • 14 Iterative search
    • 14.1 SVM model as motivating example
    • 14.2 Bayesian Optimization
      • 14.2.1 Gaussian process model, at a high level
    • 14.3 Simulated annealing
      • 14.3.1 How it works
      • 14.3.2 The tune_sim_anneal() function
    • 14.4 References
    • 14.5 Meeting Videos
      • 14.5.1 Cohort 1
      • 14.5.2 Cohort 2
      • 14.5.3 Cohort 3
      • 14.5.4 Cohort 4
  • 15 Screening Many Models
    • 15.1 Obligatory Setup
    • 15.2 Creating workflow_sets
    • 15.3 Ranking models
    • 15.4 Finalizing the model
    • 15.5 Meeting Videos
      • 15.5.1 Cohort 1
      • 15.5.2 Cohort 2
      • 15.5.3 Cohort 3
      • 15.5.4 Cohort 4
  • Review of chapters 10-15
    • 15.6 Meeting Videos
      • 15.6.1 Cohort 1
  • 16 Dimensionality reduction
    • 16.1 {recipes} without {workflows}
    • 16.2 Why do dimensionality reduction?
    • 16.3 Introducing the beans dataset
    • 16.4 Prepare the beans data using recipes
    • 16.5 Principal Component Analysis (PCA)
    • 16.6 Partial Least Squares (PLS)
    • 16.7 Independent Component Anysis (ICA)
    • 16.8 Uniform Manifold Approximation and Projection (UMAP)
    • 16.9 Modeling
    • 16.10 Meeting Videos
      • 16.10.1 Cohort 1
      • 16.10.2 Cohort 3
      • 16.10.3 Cohort 4
  • Other Topics
  • 17 Encoding categorical data
    • 17.1 Effect of Encoding
    • 17.2 Encoding methods:
      • 17.2.1 Cohort 4
  • 18 Explaining models and predictions
    • 18.1 Chapter 18 Setup
    • 18.2 Overview
    • 18.3 Local Explanations
    • 18.4 Local Explanations for Interactions
    • 18.5 Global Explanations
    • 18.6 Global Explanations from Local Explanations
    • 18.7 References
    • 18.8 Meeting Videos
      • 18.8.1 Cohort 1
      • 18.8.2 Cohort 3
      • 18.8.3 Cohort 4
  • 19 When should you trust predictions?
    • 19.1 Equivocal Results
    • 19.2 Model Applicability
    • 19.3 Meeting Videos
      • 19.3.1 Cohort 1
      • 19.3.2 Cohort 3
      • 19.3.3 Cohort 4
  • 20 Ensembles of models
    • 20.1 Ensembling
    • 20.2 Ensembling with stacks!
    • 20.3 Define some models
    • 20.4 Initialize and add members to stack.
    • 20.5 Blend, fit, predict
    • 20.6 Case Study - Patient Risk Profiles
      • 20.6.1 Loading necessary libraries
      • 20.6.2 Loading data
      • 20.6.3 Have a look at the data
      • 20.6.4 Data wrangling
      • 20.6.5 Data to use in the Model
      • 20.6.6 Spending Data
      • 20.6.7 Preprocessing and Recipes
      • 20.6.8 Make many models
      • 20.6.9 Build the Workflow
      • 20.6.10 Tuning: Grid and Race
      • 20.6.11 Stacking (Ensembles of models)
    • 20.7 Meeting Videos
      • 20.7.1 Cohort 1
      • 20.7.2 Cohort 3
      • 20.7.3 Cohort 4
  • 21 Inferential analysis
    • 21.1 Dataset used for demonstrating inference
    • 21.2 Tidy method from the {broom} package
    • 21.3 {infer} for simple, high level hypothesis testing
      • 21.3.1 p value for idependence based on simulation with permutation
      • 21.3.2 Confidence interval for correlation based on simulation with bootstrapping
      • 21.3.3 Use theory instead of simulation
      • 21.3.4 Linear models with multiple explanatory variables
    • 21.4 reg_intervals from {rsample}
    • 21.5 Inference with lower level helpers
    • 21.6 Meeting Videos
      • 21.6.1 Cohort 3
      • 21.6.2 Cohort 4
  • Published with bookdown

Tidy Modeling with R Book Club

13.4 Tools for efficient grid search

A few tricks:

13.4.1 Submodel optimization

Types of models where, from a single model fit, multiple tuning parameters can be evaluated without refitting:

  • Partial Least Squares (no. of components to retain)

  • Boosting models (no. of boosting iterations, i.e. trees)

  • glmnet makes (across the amount of regularization)

  • MARS adds a set of nonlinear features (number of terms to retain)

The tune package automatically applies this type of optimization whenever an applicable model is tuned. See also this vignette

methods("multi_predict")
##  [1] multi_predict._C5.0*        multi_predict._earth*      
##  [3] multi_predict._elnet*       multi_predict._glmnetfit*  
##  [5] multi_predict._lognet*      multi_predict._multnet*    
##  [7] multi_predict._torch_mlp*   multi_predict._train.kknn* 
##  [9] multi_predict._xgb.Booster* multi_predict.default*     
## see '?methods' for accessing help and source code
parsnip:::multi_predict._C5.0 %>% 
  formals() %>% 
  names()
## [1] "object"   "new_data" "type"     "trees"    "..."

For example, if a C5.0 model is fit to this cell classification data challenge, we can tune the trees. With all other parameters set at their default values, we can rapidly evaluate iterations from 1 to 100 :

data(cells)

cells <- cells %>% select(-case)

cell_folds <- vfold_cv(cells)

roc_res <- metric_set(roc_auc)

c5_spec <- 
  boost_tree(trees = tune()) %>% 
  set_engine("C5.0") %>% 
  set_mode("classification")

set.seed(2)
c5_tune <- c5_spec %>%
  tune_grid(
    class ~ .,
    resamples = cell_folds,
    grid = data.frame(trees = 1:100),
    metrics = roc_res
  )

Even though we fit the model without the submodel prediction trick, this optimization is automatically applied by parsnip.

autoplot(c5_tune)

ggsave("images/13_c5_submodel.png")

13.4.2 Parallel processing

backend packages right now are doFuture, doMC, doMPI, doParallel, doRedis,doRNG, doSNOW, and doAzureParallel

In tune_*(), there are two approaches, often set in control_grid() or control_resamples()

  • parallel_over = "resamples or

  • parallel_over = "everything" or

  • parallel_over = NULL (the default) chooses “resamples” if there are more than one resample, otherwise chooses “everything” to attempt to maximize core utilization

Note that switching between parallel_over strategies is not guaranteed to use the same random number generation schemes. However, re-tuning a model using the same parallel_over strategy is guaranteed to be reproducible between runs.

To use them, register the parallel backend first.

On a shared server, never never consume all of the cores.

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
doParallel::registerDoParallel(cl)

Be careful to avoid use of variables from the global environment. For example:

num_pcs <- 3

recipe(mpg ~ ., data = mtcars) %>% 
  # Bad since num_pcs might not be found by a worker process
  step_pca(all_predictors(), num_comp = num_pcs)

recipe(mpg ~ ., data = mtcars) %>% 
  # Good since the value is injected into the object
  step_pca(all_predictors(), num_comp = !!num_pcs)

for the most part, the logging provided by tune_grid() will not be seen when running in parallel.

13.4.3 Benchmarking Parallel with boosted trees

Three scenarios

  1. Preprocess the data prior to modeling using dplyr

  2. Conduct the same preprocessing via a recipe

  3. With a recipe, add a step that has a high computational cost

using variable numbers of worker processes and using the two parallel_over options, on a computer with 10 physical cores

For dplyr and the simple recipe

  • There is little difference in the execution times between the panels.

  • There is some benefit for using parallel_over = "everything" with many cores. However, as shown in the figure, the majority of the benefit of parallel processing occurs in the first five workers.

With the expensive preprocessing step, there is a considerable difference in execution times. Using parallel_over = "everything" is problematic since, even using all cores, it never achieves the execution time that parallel_over = "resamples" attains with just five cores. This is because the costly preprocessing step is unnecessarily repeated in the computational scheme.

Overall, note that the increased computational savings will vary from model-to-model and are also affected by the size of the grid, the number of resamples, etc. A very computationally efficient model may not benefit as much from parallel processing.

13.4.4 Racing Methods

The finetune package contains functions for racing.

One issue with grid search is that all models need to be fit across all resamples before any tuning parameters can be evaluated. It would be helpful if instead, at some point during tuning, an interim analysis could be conducted to eliminate any truly awful parameter candidates.

In racing methods the tuning process evaluates all models on an initial subset of resamples. Based on their current performance metrics, some parameter sets are not considered in subsequent resamples.

As an example, in the Chicago multilayer perceptron tuning process with a regular grid above, what would the results look like after only the first three folds?

We can fit a model where the outcome is the resampled area under the ROC curve and the predictor is an indicator for the parameter combination. The model takes the resample-to-resample effect into account and produces point and interval estimates for each parameter setting. The results of the model are one-sided 95% confidence intervals that measure the loss of the ROC value relative to the currently best performing parameters.

Any parameter set whose confidence interval includes zero would lack evidence that its performance is not statistically different from the best results. We retain 10 settings; these are resampled more. The remaining 10 submodels are no longer considered.

Video

Racing methods can be more efficient than basic grid search as long as the interim analysis is fast and some parameter settings have poor performance. It also is most helpful when the model does not have the ability to exploit submodel predictions.

The tune_race_anova() function conducts an Analysis of Variance (ANOVA) model to test for statistical significance of the different model configurations.

library(finetune)

set.seed(99)
mlp_sfd_race <-
  mlp_wflow %>%
  tune_race_anova(
    Chicago_folds,
    grid = 20,
    param_info = mlp_param,
    metrics = rmse_mape_rsq_iic,
    control = control_race(verbose_elim = TRUE)
  )

write_rds(mlp_sfd_race, 
          "data/13-Chicago-mlp_sfd_race.rds",
          compress = "gz")
autoplot(mlp_sfd_race)

ggsave("images/13_mlp_sfd_race.png",
       width = 12)

show_best(mlp_sfd_race, n = 6)
  hidden_units  penalty epochs num_comp .metric .estimator  mean     n
         <int>    <dbl>  <int>    <int> <chr>   <chr>      <dbl> <int>
1            6 3.08e- 5    126        3 rmse    standard    2.47     8
2            8 2.15e- 1    148        9 rmse    standard    2.48     8
3           10 9.52e- 3    157        3 rmse    standard    2.55     8
4            6 2.60e-10     84       12 rmse    standard    2.56     8
5            5 1.48e- 2     94        4 rmse    standard    2.57     8
6            4 7.08e- 1     98       14 rmse    standard    2.60     8
# ... with 2 more variables: std_err <dbl>, .config <chr>
Warning message:
No value of `metric` was given; metric 'rmse' will be used.