• Modelado Tidy con R - Club de Lectura
  • Bienvenida
    • Reuniones del club de lectura
  • 1 Software para modelado
    • 1.1 The pit of success
    • 1.2 Types of models
    • 1.3 Terminology
    • 1.4 The data analysis process
    • 1.5 The modeling process
    • 1.6 Videos de las reuniones
      • 1.6.1 Cohorte 1
  • 2 A tidyverse primer
    • 2.1 Tidyverse design Principles
    • 2.2 Design for Humans - Overview
    • 2.3 Design for Humans and the Tidyverse
    • 2.4 Reusing existing data structures
    • 2.5 Designed for the pipe
    • 2.6 Designed for Functional Programming
    • 2.7 Tibbles vs. Data Frames
    • 2.8 How to read and wrangle data
    • 2.9 Further Reading
    • 2.10 Videos de las reuniones
      • 2.10.1 Cohorte 1
  • 3 A review of R modeling fundamentals
    • 3.1 R formula syntax
      • 3.1.1 Recap
    • 3.2 Inspecting and developing models
    • 3.3 More of {base} and {stats}
    • 3.4 Why Tidy Principles and {tidymodels}?
    • 3.5 Videos de las reuniones
      • 3.5.1 Cohorte 1
  • Revisión de Tidyverse
    • Usando pipes
    • Iteraciones en R
      • En paralelo
    • Referencias
    • Videos de las reuniones
      • Cohorte 1
  • Basics
  • 4 The Ames housing data
    • 4.1 Pittsburgh: a parallel real world example
    • 4.2 Videos de las reuniones
      • 4.2.1 Cohorte 1
  • 5 Spending our data
    • 5.1 Spending our data
    • 5.2 Common methods for splitting data
    • 5.3 Class imbalance
      • 5.3.1 Stratified sampling simulation
    • 5.4 Continuous outcome data
    • 5.5 Time series data
    • 5.6 Multi-level data
    • 5.7 What proportion should be used?
    • 5.8 Summary
      • 5.8.1 References
    • 5.9 Videos de las reuniones
      • 5.9.1 Cohorte 1
  • 6 Ajustando modelos con parsnip
    • 6.1 Crear un modelo
      • 6.1.1 Diferentes Interfaces de Modelado
      • 6.1.2 Especificación del Modelo
      • 6.1.3 Ajustando el Modelo
      • 6.1.4 Argumentos de Modelo Generalizados
    • 6.2 Usar los resultados del Modelo
    • 6.3 Haciendo Predicciones
    • 6.4 Paquetes Adjacentes a {tidymodels}
    • 6.5 Resumen
    • 6.6 Videos de las reuniones
      • 6.6.1 Cohorte 1
  • AI Ethics
    • 6.7 Videos de las reuniones
      • 6.7.1 Cohorte 1
  • 7 A model workflow
    • 7.1 Workflows
    • 7.2 Demonstration
      • 7.2.1 Some data exploration and cleaning
    • 7.3 Modeling with workflows
      • 7.3.1 Different model, same recipe
      • 7.3.2 Same model, different preprocessing
    • 7.4 Managing many workflows
    • 7.5 Notes
    • 7.6 Videos de las reuniones
      • 7.6.1 Cohorte 1
  • 8 Feature engineering with recipes
    • 8.1 Videos de las reuniones
      • 8.1.1 Cohorte 1
  • 9 Judging model effectiveness
    • 9.1 Measures of Model Fit
    • 9.2 Disclaimers
    • 9.3 Regression Metrics
    • 9.4 Binary Classification Metrics
    • 9.5 References
    • 9.6 Videos de las reuniones
      • 9.6.1 Cohorte 1
  • Review of chapters 4-9
    • 9.7 Videos de las reuniones
      • 9.7.1 Cohorte 1
  • Tools for Creating Effective Models
  • 10 Resampling for evaluating performance
    • 10.1 Why?
    • 10.2 Resubstitution approach
    • 10.3 Resampling methods
      • 10.3.1 Cross-validation
      • 10.3.2 Validation sets
      • 10.3.3 Boostrapping
      • 10.3.4 Rolling forecasting origin resampling
    • 10.4 Estimating performance
    • 10.5 Parallel processing
    • 10.6 Saving the resampled objects
    • 10.7 Videos de las reuniones
      • 10.7.1 Cohorte 1
  • 11 Comparing models with resampling
    • 11.1 Calculate performance statistics
    • 11.2 Calculate performance statistics: {workflowsets}
    • 11.3 Within-resample correlation
    • 11.4 Practical effect size
    • 11.5 Simple Comparison
    • 11.6 Bayesian methods
    • 11.7 Videos de las reuniones
      • 11.7.1 Cohorte 1
  • 12 Model tuning and the dangers of overfitting
    • 12.1 What is a Tuning Parameter?
      • 12.1.1 Examples
    • 12.2 When not to tune
    • 12.3 Decisions, Decisions…
    • 12.4 What Metric Should We Use?
    • 12.5 Can we make our model too good?
    • 12.6 Tuning Parameter Optimization Strategies
    • 12.7 Tuning Parameters in tidymodels {dials}
    • 12.8 Let’s try an example:
    • 12.9 Build our random forest model:
    • 12.10 Add tuning parameters:
    • 12.11 Updating tuning parameters:
    • 12.12 Finalizing tuning parameters:
    • 12.13 What is next?
    • 12.14 Videos de las reuniones
      • 12.14.1 Cohorte 1
  • 13 Grid search
    • 13.1 Regular and non-regular grids
      • 13.1.1 Regular Grids
      • 13.1.2 Irregular Grids
    • 13.2 Evaluating the grid
    • 13.3 Finalizing the model
    • 13.4 Tools for efficient grid search
      • 13.4.1 Submodel optimization
      • 13.4.2 Parallel processing
      • 13.4.3 Benchmarking Parallel with boosted trees
      • 13.4.4 Racing Methods
    • 13.5 Chapter Summary
    • 13.6 Videos de las reuniones
      • 13.6.1 Cohorte 1
  • 14 Iterative search
    • 14.1 SVM model as motivating example
    • 14.2 Bayesian Optimization
      • 14.2.1 Gaussian process model, at a high level
    • 14.3 Simulated annealing
      • 14.3.1 How it works
      • 14.3.2 The tune_sim_anneal() function
    • 14.4 References
    • 14.5 Videos de las reuniones
      • 14.5.1 Cohorte 1
  • 15 Screening Many Models
    • 15.1 Obligatory Setup
    • 15.2 Creating workflow_sets
    • 15.3 Ranking models
    • 15.4 Finalizing the model
    • 15.5 Videos de las reuniones
      • 15.5.1 Cohorte 1
  • Review of chapters 10-15
    • 15.6 Videos de las reuniones
      • 15.6.1 Cohorte 1
  • 16 Dimensionality reduction
    • 16.1 {recipes} without {workflows}
    • 16.2 Principal Component Analysis (PCA)
    • 16.3 Partial Least Squares (PLS)
    • 16.4 Independent Component Anysis (ICA)
    • 16.5 Uniform Manifold Approximation and Projection (UMAP)
    • 16.6 Modeling
    • 16.7 Videos de las reuniones
      • 16.7.1 Cohorte 1
  • Other Topics
  • 17 Encoding categorical data
    • 17.1 Slide 1 Title
    • 17.2 Slide 2 Title
    • 17.3 Videos de las reuniones
      • 17.3.1 Cohorte 1
  • 18 Explaining models and predictions
    • 18.1 Chapter 18 Setup
    • 18.2 Overview
    • 18.3 Local Explanations
    • 18.4 Local Explanations for Interactions
    • 18.5 Global Explanations
    • 18.6 Global Explanations from Local Explanations
    • 18.7 References
    • 18.8 Videos de las reuniones
      • 18.8.1 Cohorte 1
  • 19 When should you trust predictions?
    • 19.1 Equivocal Results
    • 19.2 Model Applicability
    • 19.3 Videos de las reuniones
      • 19.3.1 Cohorte 1
  • 20 Ensembles of models
    • 20.1 Ensembling
    • 20.2 Ensembling with stacks!
    • 20.3 Define some models
    • 20.4 Initialize and add members to stack.
    • 20.5 Blend, fit, predict
    • 20.6 Videos de las reuniones
      • 20.6.1 Cohorte 1
  • 21 Inferential analysis
    • 21.1 Dataset used for demonstrating inference
    • 21.2 Tidy method from the {broom} package
    • 21.3 {infer} for simple, high level hypothesis testing
      • 21.3.1 p value for idependence based on simulation with permutation
      • 21.3.2 Confidence interval for correlation based on simulation with bootstrapping
      • 21.3.3 Use theory instead of simulation
      • 21.3.4 Linear models with multiple explanatory variables
    • 21.4 reg_intervals from {rsample}
    • 21.5 Inference with lower level helpers
    • 21.6 Videos de las reuniones
      • 21.6.1 Cohorte 1
  • Publicado con bookdown

Modelado Tidy con R - Club de Lectura

13.4 Tools for efficient grid search

A few tricks:

13.4.1 Submodel optimization

Types of models where, from a single model fit, multiple tuning parameters can be evaluated without refitting:

  • Partial Least Squares (no. of components to retain)

  • Boosting models (no. of boosting iterations, i.e. trees)

  • glmnet makes (across the amount of regularization)

  • MARS adds a set of nonlinear features (number of terms to retain)

The tune package automatically applies this type of optimization whenever an applicable model is tuned. See also this vignette

methods("multi_predict")
##  [1] multi_predict._C5.0*        multi_predict._earth*      
##  [3] multi_predict._elnet*       multi_predict._glmnetfit*  
##  [5] multi_predict._lognet*      multi_predict._multnet*    
##  [7] multi_predict._torch_mlp*   multi_predict._train.kknn* 
##  [9] multi_predict._xgb.Booster* multi_predict.default*     
## see '?methods' for accessing help and source code
parsnip:::multi_predict._C5.0 %>% 
  formals() %>% 
  names()
## [1] "object"   "new_data" "type"     "trees"    "..."

For example, if a C5.0 model is fit to this cell classification data challenge, we can tune the trees. With all other parameters set at their default values, we can rapidly evaluate iterations from 1 to 100 :

data(cells)

cells <- cells %>% select(-case)

cell_folds <- vfold_cv(cells)

roc_res <- metric_set(roc_auc)

c5_spec <- 
  boost_tree(trees = tune()) %>% 
  set_engine("C5.0") %>% 
  set_mode("classification")

set.seed(2)
c5_tune <- c5_spec %>%
  tune_grid(
    class ~ .,
    resamples = cell_folds,
    grid = data.frame(trees = 1:100),
    metrics = roc_res
  )

Even though we fit the model without the submodel prediction trick, this optimization is automatically applied by parsnip.

autoplot(c5_tune)

ggsave("images/13_c5_submodel.png")

13.4.2 Parallel processing

backend packages right now are doFuture, doMC, doMPI, doParallel, doRedis,doRNG, doSNOW, and doAzureParallel

In tune_*(), there are two approaches, often set in control_grid() or control_resamples()

  • parallel_over = "resamples or

  • parallel_over = "everything" or

  • parallel_over = NULL (the default) chooses “resamples” if there are more than one resample, otherwise chooses “everything” to attempt to maximize core utilization

Note that switching between parallel_over strategies is not guaranteed to use the same random number generation schemes. However, re-tuning a model using the same parallel_over strategy is guaranteed to be reproducible between runs.

To use them, register the parallel backend first.

On a shared server, never never consume all of the cores.

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
doParallel::registerDoParallel(cl)

Be careful to avoid use of variables from the global environment. For example:

num_pcs <- 3

recipe(mpg ~ ., data = mtcars) %>% 
  # Bad since num_pcs might not be found by a worker process
  step_pca(all_predictors(), num_comp = num_pcs)

recipe(mpg ~ ., data = mtcars) %>% 
  # Good since the value is injected into the object
  step_pca(all_predictors(), num_comp = !!num_pcs)

for the most part, the logging provided by tune_grid() will not be seen when running in parallel.

13.4.3 Benchmarking Parallel with boosted trees

Three scenarios

  1. Preprocess the data prior to modeling using dplyr

  2. Conduct the same preprocessing via a recipe

  3. With a recipe, add a step that has a high computational cost

using variable numbers of worker processes and using the two parallel_over options, on a computer with 10 physical cores

For dplyr and the simple recipe

  • There is little difference in the execution times between the panels.

  • There is some benefit for using parallel_over = "everything" with many cores. However, as shown in the figure, the majority of the benefit of parallel processing occurs in the first five workers.

With the expensive preprocessing step, there is a considerable difference in execution times. Using parallel_over = "everything" is problematic since, even using all cores, it never achieves the execution time that parallel_over = "resamples" attains with just five cores. This is because the costly preprocessing step is unnecessarily repeated in the computational scheme.

Overall, note that the increased computational savings will vary from model-to-model and are also affected by the size of the grid, the number of resamples, etc. A very computationally efficient model may not benefit as much from parallel processing.

13.4.4 Racing Methods

The finetune package contains functions for racing.

One issue with grid search is that all models need to be fit across all resamples before any tuning parameters can be evaluated. It would be helpful if instead, at some point during tuning, an interim analysis could be conducted to eliminate any truly awful parameter candidates.

In racing methods the tuning process evaluates all models on an initial subset of resamples. Based on their current performance metrics, some parameter sets are not considered in subsequent resamples.

As an example, in the Chicago multilayer perceptron tuning process with a regular grid above, what would the results look like after only the first three folds?

We can fit a model where the outcome is the resampled area under the ROC curve and the predictor is an indicator for the parameter combination. The model takes the resample-to-resample effect into account and produces point and interval estimates for each parameter setting. The results of the model are one-sided 95% confidence intervals that measure the loss of the ROC value relative to the currently best performing parameters.

Any parameter set whose confidence interval includes zero would lack evidence that its performance is not statistically different from the best results. We retain 10 settings; these are resampled more. The remaining 10 submodels are no longer considered.

Video

Racing methods can be more efficient than basic grid search as long as the interim analysis is fast and some parameter settings have poor performance. It also is most helpful when the model does not have the ability to exploit submodel predictions.

The tune_race_anova() function conducts an Analysis of Variance (ANOVA) model to test for statistical significance of the different model configurations.

library(finetune)

set.seed(99)
mlp_sfd_race <-
  mlp_wflow %>%
  tune_race_anova(
    Chicago_folds,
    grid = 20,
    param_info = mlp_param,
    metrics = rmse_mape_rsq_iic,
    control = control_race(verbose_elim = TRUE)
  )

write_rds(mlp_sfd_race, 
          "data/13-Chicago-mlp_sfd_race.rds",
          compress = "gz")
autoplot(mlp_sfd_race)

ggsave("images/13_mlp_sfd_race.png",
       width = 12)

show_best(mlp_sfd_race, n = 6)
  hidden_units  penalty epochs num_comp .metric .estimator  mean     n
         <int>    <dbl>  <int>    <int> <chr>   <chr>      <dbl> <int>
1            6 3.08e- 5    126        3 rmse    standard    2.47     8
2            8 2.15e- 1    148        9 rmse    standard    2.48     8
3           10 9.52e- 3    157        3 rmse    standard    2.55     8
4            6 2.60e-10     84       12 rmse    standard    2.56     8
5            5 1.48e- 2     94        4 rmse    standard    2.57     8
6            4 7.08e- 1     98       14 rmse    standard    2.60     8
# ... with 2 more variables: std_err <dbl>, .config <chr>
Warning message:
No value of `metric` was given; metric 'rmse' will be used.