• Tidy Modeling with R Book Club
  • Welcome
    • Book club meetings
  • 1 Software for modeling
    • 1.1 The pit of success
    • 1.2 Types of models
    • 1.3 Terminology
    • 1.4 The data analysis process
    • 1.5 The modeling process
    • 1.6 Meeting Videos
      • 1.6.1 Cohort 1
      • 1.6.2 Cohort 2
      • 1.6.3 Cohort 3
      • 1.6.4 Cohort 4
  • 2 A tidyverse primer
    • 2.1 Tidyverse design Principles
    • 2.2 Design for Humans - Overview
    • 2.3 Design for Humans and the Tidyverse
    • 2.4 Reusing existing data structures
    • 2.5 Designed for the pipe
    • 2.6 Designed for Functional Programming
    • 2.7 Tibbles vs. Data Frames
    • 2.8 How to read and wrangle data
    • 2.9 Further Reading
    • 2.10 Meeting Videos
      • 2.10.1 Cohort 1
      • 2.10.2 Cohort 2
      • 2.10.3 Cohort 3
      • 2.10.4 Cohort 4
  • 3 A review of R modeling fundamentals
    • 3.1 R formula syntax
      • 3.1.1 Recap
    • 3.2 Inspecting and developing models
    • 3.3 More of {base} and {stats}
    • 3.4 Why Tidy Principles and {tidymodels}?
    • 3.5 Meeting Videos
      • 3.5.1 Cohort 1
      • 3.5.2 Cohort 2
      • 3.5.3 Cohort 3
      • 3.5.4 Cohort 4
  • Basics
  • 4 The Ames housing data
    • 4.1 Pittsburgh: a parallel real world example
    • 4.2 Meeting Videos
      • 4.2.1 Cohort 1
      • 4.2.2 Cohort 2
      • 4.2.3 Cohort 3
      • 4.2.4 Cohort 4
  • 5 Spending our data
    • 5.1 Spending our data
    • 5.2 Common methods for splitting data
    • 5.3 Class imbalance
      • 5.3.1 Stratified sampling simulation
    • 5.4 Continuous outcome data
    • 5.5 Time series data
    • 5.6 Multi-level data
    • 5.7 What proportion should be used?
    • 5.8 Summary
      • 5.8.1 References
    • 5.9 Meeting Videos
      • 5.9.1 Cohort 1
      • 5.9.2 Cohort 2
      • 5.9.3 Cohort 3
      • 5.9.4 Cohort 4
  • 6 Fitting models with parsnip
    • 6.1 Create a Model
      • 6.1.1 Different Model Interfaces
      • 6.1.2 Model Specification
      • 6.1.3 Model Fitting
      • 6.1.4 Generalized Model Arguments
    • 6.2 Use Model results
    • 6.3 Make Predictions
    • 6.4 {tidymodels}-Adjacent Packages
    • 6.5 Summary
    • 6.6 Meeting Videos
      • 6.6.1 Cohort 1
      • 6.6.2 Cohort 2
      • 6.6.3 Cohort 3
      • 6.6.4 Cohort 4
  • AI Ethics
    • 6.7 Meeting Videos
      • 6.7.1 Cohort 2
  • 7 A model workflow
    • 7.1 Workflows
    • 7.2 Demonstration
      • 7.2.1 Some data exploration and cleaning
    • 7.3 Modeling with workflows
      • 7.3.1 Different model, same recipe
      • 7.3.2 Same model, different preprocessing
    • 7.4 Managing many workflows
    • 7.5 Notes
    • 7.6 Meeting Videos
      • 7.6.1 Cohort 1
      • 7.6.2 Cohort 2
      • 7.6.3 Cohort 3
      • 7.6.4 Cohort 4
  • 8 Feature engineering with recipes
    • 8.1 In summary
      • 8.1.1 Intoduction
    • 8.2 A SIMPLE RECIPE FOR THE AMES HOUSING DATA
    • 8.3 USING RECIPES
    • 8.4 HOW DATA ARE USED BY THE RECIPE
      • 8.4.1 EXAMPLES OF RECIPE STEPS:
    • 8.5 SKIPPING STEPS FOR NEW DATA
    • 8.6 TIDY A RECIPE
    • 8.7 COLUMN ROLES
    • 8.8 Resources
    • 8.9 Meeting Videos
      • 8.9.1 Cohort 1
      • 8.9.2 Cohort 2
      • 8.9.3 Cohort 3
      • 8.9.4 Cohort 4
  • 9 Judging model effectiveness
    • 9.1 Performance Metrics and Inference
    • 9.2 A little Recap of previous chapters
      • 9.2.1 Case Study 1
      • 9.2.2 Tidymodels: modeling as a step by step mode
      • 9.2.3 Workflow: to combine models and recipes
      • 9.2.4 Case Study 2
    • 9.3 Functions used to measure predictive strengths of a model
      • 9.3.1 Case Study 3
      • 9.3.2 Conclusion
    • 9.4 Measures of Model Fit - Case Study (Cohort 1)
    • 9.5 Disclaimers
    • 9.6 Regression Metrics
    • 9.7 Binary Classification Metrics
    • 9.8 References
    • 9.9 Meeting Videos
      • 9.9.1 Cohort 1
      • 9.9.2 Cohort 2
      • 9.9.3 Cohort 3
      • 9.9.4 Cohort 4
  • Review of chapters 4-9
    • 9.10 Meeting Videos
      • 9.10.1 Cohort 1
  • Tools for Creating Effective Models
  • 10 Resampling for evaluating performance
    • 10.1 Why?
    • 10.2 Resubstitution approach
    • 10.3 Resampling methods
      • 10.3.1 Cross-validation
      • 10.3.2 Validation sets
      • 10.3.3 Boostrapping
      • 10.3.4 Rolling forecasting origin resampling
    • 10.4 Estimating performance
    • 10.5 Parallel processing
    • 10.6 Saving the resampled objects
    • 10.7 Meeting Videos
      • 10.7.1 Cohort 1
      • 10.7.2 Cohort 2
      • 10.7.3 Cohort 3
      • 10.7.4 Cohort 4
  • 11 Comparing models with resampling
    • 11.1 Calculate performance statistics
    • 11.2 Calculate performance statistics: {workflowsets}
    • 11.3 Within-resample correlation
    • 11.4 Practical effect size
    • 11.5 Simple Comparison
    • 11.6 Bayesian methods
    • 11.7 Meeting Videos
      • 11.7.1 Cohort 1
      • 11.7.2 Cohort 2
      • 11.7.3 Cohort 3
      • 11.7.4 Cohort 4
  • 12 Model tuning and the dangers of overfitting
    • 12.1 What is a Tuning Parameter?
      • 12.1.1 Examples
    • 12.2 When not to tune
    • 12.3 Decisions, Decisions…
    • 12.4 What Metric Should We Use?
    • 12.5 Can we make our model too good?
    • 12.6 Tuning Parameter Optimization Strategies
    • 12.7 Tuning Parameters in tidymodels {dials}
    • 12.8 Let’s try an example:
    • 12.9 Build our random forest model:
    • 12.10 Add tuning parameters:
    • 12.11 Updating tuning parameters:
    • 12.12 Finalizing tuning parameters:
    • 12.13 What is next?
    • 12.14 Meeting Videos
      • 12.14.1 Cohort 1
      • 12.14.2 Cohort 2
      • 12.14.3 Cohort 3
      • 12.14.4 Cohort 4
  • 13 Grid search
    • 13.1 Regular and non-regular grids
      • 13.1.1 Regular Grids
      • 13.1.2 Irregular Grids
    • 13.2 Evaluating the grid
    • 13.3 Finalizing the model
    • 13.4 Tools for efficient grid search
      • 13.4.1 Submodel optimization
      • 13.4.2 Parallel processing
      • 13.4.3 Benchmarking Parallel with boosted trees
      • 13.4.4 Racing Methods
    • 13.5 Chapter Summary
    • 13.6 Meeting Videos
      • 13.6.1 Cohort 1
      • 13.6.2 Cohort 2
      • 13.6.3 Cohort 3
      • 13.6.4 Cohort 4
  • 14 Iterative search
    • 14.1 SVM model as motivating example
    • 14.2 Bayesian Optimization
      • 14.2.1 Gaussian process model, at a high level
    • 14.3 Simulated annealing
      • 14.3.1 How it works
      • 14.3.2 The tune_sim_anneal() function
    • 14.4 References
    • 14.5 Meeting Videos
      • 14.5.1 Cohort 1
      • 14.5.2 Cohort 2
      • 14.5.3 Cohort 3
      • 14.5.4 Cohort 4
  • 15 Screening Many Models
    • 15.1 Obligatory Setup
    • 15.2 Creating workflow_sets
    • 15.3 Ranking models
    • 15.4 Finalizing the model
    • 15.5 Meeting Videos
      • 15.5.1 Cohort 1
      • 15.5.2 Cohort 2
      • 15.5.3 Cohort 3
      • 15.5.4 Cohort 4
  • Review of chapters 10-15
    • 15.6 Meeting Videos
      • 15.6.1 Cohort 1
  • 16 Dimensionality reduction
    • 16.1 {recipes} without {workflows}
    • 16.2 Why do dimensionality reduction?
    • 16.3 Introducing the beans dataset
    • 16.4 Prepare the beans data using recipes
    • 16.5 Principal Component Analysis (PCA)
    • 16.6 Partial Least Squares (PLS)
    • 16.7 Independent Component Anysis (ICA)
    • 16.8 Uniform Manifold Approximation and Projection (UMAP)
    • 16.9 Modeling
    • 16.10 Meeting Videos
      • 16.10.1 Cohort 1
      • 16.10.2 Cohort 3
      • 16.10.3 Cohort 4
  • Other Topics
  • 17 Encoding categorical data
    • 17.1 Effect of Encoding
    • 17.2 Encoding methods:
      • 17.2.1 Cohort 4
  • 18 Explaining models and predictions
    • 18.1 Chapter 18 Setup
    • 18.2 Overview
    • 18.3 Local Explanations
    • 18.4 Local Explanations for Interactions
    • 18.5 Global Explanations
    • 18.6 Global Explanations from Local Explanations
    • 18.7 References
    • 18.8 Meeting Videos
      • 18.8.1 Cohort 1
      • 18.8.2 Cohort 3
      • 18.8.3 Cohort 4
  • 19 When should you trust predictions?
    • 19.1 Equivocal Results
    • 19.2 Model Applicability
    • 19.3 Meeting Videos
      • 19.3.1 Cohort 1
      • 19.3.2 Cohort 3
      • 19.3.3 Cohort 4
  • 20 Ensembles of models
    • 20.1 Ensembling
    • 20.2 Ensembling with stacks!
    • 20.3 Define some models
    • 20.4 Initialize and add members to stack.
    • 20.5 Blend, fit, predict
    • 20.6 Case Study - Patient Risk Profiles
      • 20.6.1 Loading necessary libraries
      • 20.6.2 Loading data
      • 20.6.3 Have a look at the data
      • 20.6.4 Data wrangling
      • 20.6.5 Data to use in the Model
      • 20.6.6 Spending Data
      • 20.6.7 Preprocessing and Recipes
      • 20.6.8 Make many models
      • 20.6.9 Build the Workflow
      • 20.6.10 Tuning: Grid and Race
      • 20.6.11 Stacking (Ensembles of models)
    • 20.7 Meeting Videos
      • 20.7.1 Cohort 1
      • 20.7.2 Cohort 3
      • 20.7.3 Cohort 4
  • 21 Inferential analysis
    • 21.1 Dataset used for demonstrating inference
    • 21.2 Tidy method from the {broom} package
    • 21.3 {infer} for simple, high level hypothesis testing
      • 21.3.1 p value for idependence based on simulation with permutation
      • 21.3.2 Confidence interval for correlation based on simulation with bootstrapping
      • 21.3.3 Use theory instead of simulation
      • 21.3.4 Linear models with multiple explanatory variables
    • 21.4 reg_intervals from {rsample}
    • 21.5 Inference with lower level helpers
    • 21.6 Meeting Videos
      • 21.6.1 Cohort 3
      • 21.6.2 Cohort 4
  • Published with bookdown

Tidy Modeling with R Book Club

Chapter 14 Iterative search

Learning objectives:

  • Use tune::tune_bayes() to optimize model parameters using Bayesian optimization.
    • Describe how a Gaussian process model can be applied to parameter optimization.
    • Explain how acquisition functions can be expressed as a trade-off between exploration and exploitation.
    • Describe expected improvement, the default acquisition function used by {tidymodels}.
  • Use finetune::tune_sim_anneal() to optimize model parameters using Simulated annealing.
    • Describe simulated annealing search.