Tidy Modeling with R Book Club
Welcome
Book club meetings
1
Software for modeling
1.1
The pit of success
1.2
Types of models
1.3
Terminology
1.4
The data analysis process
1.5
The modeling process
1.6
Meeting Videos
1.6.1
Cohort 1
1.6.2
Cohort 2
1.6.3
Cohort 3
1.6.4
Cohort 4
2
A tidyverse primer
2.1
Tidyverse design Principles
2.2
Design for Humans - Overview
2.3
Design for Humans and the Tidyverse
2.4
Reusing existing data structures
2.5
Designed for the pipe
2.6
Designed for Functional Programming
2.7
Tibbles vs. Data Frames
2.8
How to read and wrangle data
2.9
Further Reading
2.10
Meeting Videos
2.10.1
Cohort 1
2.10.2
Cohort 2
2.10.3
Cohort 3
2.10.4
Cohort 4
3
A review of R modeling fundamentals
3.1
R formula syntax
3.1.1
Recap
3.2
Inspecting and developing models
3.3
More of
{base}
and
{stats}
3.4
Why Tidy Principles and
{tidymodels}
?
3.5
Meeting Videos
3.5.1
Cohort 1
3.5.2
Cohort 2
3.5.3
Cohort 3
3.5.4
Cohort 4
Basics
4
The Ames housing data
4.1
Pittsburgh: a parallel real world example
4.2
Meeting Videos
4.2.1
Cohort 1
4.2.2
Cohort 2
4.2.3
Cohort 3
4.2.4
Cohort 4
5
Spending our data
5.1
Spending our data
5.2
Common methods for splitting data
5.3
Class imbalance
5.3.1
Stratified sampling simulation
5.4
Continuous outcome data
5.5
Time series data
5.6
Multi-level data
5.7
What proportion should be used?
5.8
Summary
5.8.1
References
5.9
Meeting Videos
5.9.1
Cohort 1
5.9.2
Cohort 2
5.9.3
Cohort 3
5.9.4
Cohort 4
6
Fitting models with parsnip
6.1
Create a Model
6.1.1
Different Model Interfaces
6.1.2
Model Specification
6.1.3
Model Fitting
6.1.4
Generalized Model Arguments
6.2
Use Model results
6.3
Make Predictions
6.4
{tidymodels}-Adjacent Packages
6.5
Summary
6.6
Meeting Videos
6.6.1
Cohort 1
6.6.2
Cohort 2
6.6.3
Cohort 3
6.6.4
Cohort 4
AI Ethics
6.7
Meeting Videos
6.7.1
Cohort 2
7
A model workflow
7.1
Workflows
7.2
Demonstration
7.2.1
Some data exploration and cleaning
7.3
Modeling with workflows
7.3.1
Different model, same recipe
7.3.2
Same model, different preprocessing
7.4
Managing many workflows
7.5
Notes
7.6
Meeting Videos
7.6.1
Cohort 1
7.6.2
Cohort 2
7.6.3
Cohort 3
7.6.4
Cohort 4
8
Feature engineering with recipes
8.1
In summary
8.1.1
Intoduction
8.2
A SIMPLE RECIPE FOR THE AMES HOUSING DATA
8.3
USING RECIPES
8.4
HOW DATA ARE USED BY THE RECIPE
8.4.1
EXAMPLES OF RECIPE STEPS:
8.5
SKIPPING STEPS FOR NEW DATA
8.6
TIDY A RECIPE
8.7
COLUMN ROLES
8.8
Resources
8.9
Meeting Videos
8.9.1
Cohort 1
8.9.2
Cohort 2
8.9.3
Cohort 3
8.9.4
Cohort 4
9
Judging model effectiveness
9.1
Performance Metrics and Inference
9.2
A little Recap of previous chapters
9.2.1
Case Study 1
9.2.2
Tidymodels: modeling as a step by step mode
9.2.3
Workflow: to combine models and recipes
9.2.4
Case Study 2
9.3
Functions used to measure predictive strengths of a model
9.3.1
Case Study 3
9.3.2
Conclusion
9.4
Measures of Model Fit - Case Study (Cohort 1)
9.5
Disclaimers
9.6
Regression Metrics
9.7
Binary Classification Metrics
9.8
References
9.9
Meeting Videos
9.9.1
Cohort 1
9.9.2
Cohort 2
9.9.3
Cohort 3
9.9.4
Cohort 4
Review of chapters 4-9
9.10
Meeting Videos
9.10.1
Cohort 1
Tools for Creating Effective Models
10
Resampling for evaluating performance
10.1
Why?
10.2
Resubstitution approach
10.3
Resampling methods
10.3.1
Cross-validation
10.3.2
Validation sets
10.3.3
Boostrapping
10.3.4
Rolling forecasting origin resampling
10.4
Estimating performance
10.5
Parallel processing
10.6
Saving the resampled objects
10.7
Meeting Videos
10.7.1
Cohort 1
10.7.2
Cohort 2
10.7.3
Cohort 3
10.7.4
Cohort 4
11
Comparing models with resampling
11.1
Calculate performance statistics
11.2
Calculate performance statistics: {workflowsets}
11.3
Within-resample correlation
11.4
Practical effect size
11.5
Simple Comparison
11.6
Bayesian methods
11.7
Meeting Videos
11.7.1
Cohort 1
11.7.2
Cohort 2
11.7.3
Cohort 3
11.7.4
Cohort 4
12
Model tuning and the dangers of overfitting
12.1
What is a Tuning Parameter?
12.1.1
Examples
12.2
When not to tune
12.3
Decisions, Decisions…
12.4
What Metric Should We Use?
12.5
Can we make our model
too
good?
12.6
Tuning Parameter Optimization Strategies
12.7
Tuning Parameters in tidymodels
{dials}
12.8
Let’s try an example:
12.9
Build our random forest model:
12.10
Add tuning parameters:
12.11
Updating tuning parameters:
12.12
Finalizing tuning parameters:
12.13
What is next?
12.14
Meeting Videos
12.14.1
Cohort 1
12.14.2
Cohort 2
12.14.3
Cohort 3
12.14.4
Cohort 4
13
Grid search
13.1
Regular and non-regular grids
13.1.1
Regular Grids
13.1.2
Irregular Grids
13.2
Evaluating the grid
13.3
Finalizing the model
13.4
Tools for efficient grid search
13.4.1
Submodel optimization
13.4.2
Parallel processing
13.4.3
Benchmarking Parallel with boosted trees
13.4.4
Racing Methods
13.5
Chapter Summary
13.6
Meeting Videos
13.6.1
Cohort 1
13.6.2
Cohort 2
13.6.3
Cohort 3
13.6.4
Cohort 4
14
Iterative search
14.1
SVM model as motivating example
14.2
Bayesian Optimization
14.2.1
Gaussian process model, at a high level
14.3
Simulated annealing
14.3.1
How it works
14.3.2
The tune_sim_anneal() function
14.4
References
14.5
Meeting Videos
14.5.1
Cohort 1
14.5.2
Cohort 2
14.5.3
Cohort 3
14.5.4
Cohort 4
15
Screening Many Models
15.1
Obligatory Setup
15.2
Creating
workflow_set
s
15.3
Ranking models
15.4
Finalizing the model
15.5
Meeting Videos
15.5.1
Cohort 1
15.5.2
Cohort 2
15.5.3
Cohort 3
15.5.4
Cohort 4
Review of chapters 10-15
15.6
Meeting Videos
15.6.1
Cohort 1
16
Dimensionality reduction
16.1
{recipes} without {workflows}
16.2
Why do dimensionality reduction?
16.3
Introducing the beans dataset
16.4
Prepare the beans data using recipes
16.5
Principal Component Analysis (PCA)
16.6
Partial Least Squares (PLS)
16.7
Independent Component Anysis (ICA)
16.8
Uniform Manifold Approximation and Projection (UMAP)
16.9
Modeling
16.10
Meeting Videos
16.10.1
Cohort 1
16.10.2
Cohort 3
16.10.3
Cohort 4
Other Topics
17
Encoding categorical data
17.1
Effect of Encoding
17.2
Encoding methods:
17.2.1
Cohort 4
18
Explaining models and predictions
18.1
Chapter 18 Setup
18.2
Overview
18.3
Local Explanations
18.4
Local Explanations for Interactions
18.5
Global Explanations
18.6
Global Explanations from Local Explanations
18.7
References
18.8
Meeting Videos
18.8.1
Cohort 1
18.8.2
Cohort 3
18.8.3
Cohort 4
19
When should you trust predictions?
19.1
Equivocal Results
19.2
Model Applicability
19.3
Meeting Videos
19.3.1
Cohort 1
19.3.2
Cohort 3
19.3.3
Cohort 4
20
Ensembles of models
20.1
Ensembling
20.2
Ensembling with
stacks
!
20.3
Define some models
20.4
Initialize and add members to stack.
20.5
Blend, fit, predict
20.6
Case Study - Patient Risk Profiles
20.6.1
Loading necessary libraries
20.6.2
Loading data
20.6.3
Have a look at the data
20.6.4
Data wrangling
20.6.5
Data to use in the Model
20.6.6
Spending Data
20.6.7
Preprocessing and Recipes
20.6.8
Make many models
20.6.9
Build the Workflow
20.6.10
Tuning: Grid and Race
20.6.11
Stacking (Ensembles of models)
20.7
Meeting Videos
20.7.1
Cohort 1
20.7.2
Cohort 3
20.7.3
Cohort 4
21
Inferential analysis
21.1
Dataset used for demonstrating inference
21.2
Tidy method from the {broom} package
21.3
{infer} for simple, high level hypothesis testing
21.3.1
p value for idependence based on simulation with permutation
21.3.2
Confidence interval for correlation based on simulation with bootstrapping
21.3.3
Use theory instead of simulation
21.3.4
Linear models with multiple explanatory variables
21.4
reg_intervals from {rsample}
21.5
Inference with lower level helpers
21.6
Meeting Videos
21.6.1
Cohort 3
21.6.2
Cohort 4
Published with bookdown
Tidy Modeling with R Book Club
12.2
When not to tune
Prior distribution (Bayesian analysis)
Number of Trees (Random Forest and Bagging)
Does not need tuning–instead focus on stability