9.2 A little Recap of previous chapters
A useful model would include:
- parameter estimation
- model selection and tuning
- performance assessment
9.2.1 Case Study 1
The construction of a model implies:
check for correlation
; to see if predictors influence each other, and in what way. For example, the predictors correlation can influence the estimation of the outcome, this can be identified when the estimated values change sign of have very different values, if a modification in association takes place.
Here we use the crickets
data from modeldata
package:
data(crickets, package = "modeldata")
%>% head crickets
## # A tibble: 6 × 3
## species temp rate
## <fct> <dbl> <dbl>
## 1 O. exclamationis 20.8 67.9
## 2 O. exclamationis 20.8 65.1
## 3 O. exclamationis 24 77.3
## 4 O. exclamationis 24 78.7
## 5 O. exclamationis 24 79.4
## 6 O. exclamationis 24 80.4
In this case we check for the correlation between the response and the predictor:
cor(crickets$rate,crickets$temp)
## [1] 0.9591189
While if we look at the correlation between predictors, which in this case one of the predictor is categorical, we need to make a transformation of the categorical predictor in order to be able to calculate the correlation.
%>%count(species) crickets
## # A tibble: 2 × 2
## species n
## <fct> <int>
## 1 O. exclamationis 14
## 2 O. niveus 17
<- crickets%>%
crickets1 mutate(species=ifelse(species=="O. exclamationis",1,0))
%>%
crickets1 head
## # A tibble: 6 × 3
## species temp rate
## <dbl> <dbl> <dbl>
## 1 1 20.8 67.9
## 2 1 20.8 65.1
## 3 1 24 77.3
## 4 1 24 78.7
## 5 1 24 79.4
## 6 1 24 80.4
cor(crickets1$species,crickets1$temp)
## [1] 0.4807908
construction of the model
; sample data are used to make a simple linear model with the functionlm()
, and a comparison of the two models is done withanova()
function.
A little EDA, exploratory data analysis is done to identfy the relationship between response and predictors.
%>%
crickets ggplot(aes(temp,rate,color=species))+
geom_point()+
geom_smooth(se=F)+
theme_linedraw()
<- lm(rate ~ (temp + species)^2, data = crickets)
interaction_fit
<- lm(rate ~ temp + species, data = crickets)
main_effect_fit # Compare the two: standard two-way analysis of variance
anova(main_effect_fit, interaction_fit)
## Analysis of Variance Table
##
## Model 1: rate ~ temp + species
## Model 2: rate ~ (temp + species)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 89.350
## 2 27 85.074 1 4.2758 1.357 0.2542
estimation of the model
; obtained using thepredict()
function; it conducts specific calculations after the fitted model is created, in this case we test it against new data.
<- data.frame(species = "O. exclamationis",
new_values temp = 15:20)
<- predict(main_effect_fit, new_values)
pred
<- bind_cols(new_values,rate=pred)
prediction_values
bind_rows(crickets,prediction_values)%>%
ggplot(aes(temp,rate,color=species))+
geom_point()+
geom_smooth(se=F)+
theme_linedraw()
9.2.2 Tidymodels: modeling as a step by step mode
The strategy of Tidymodels is to decide for a model to use on a step-by-step mode, this is the main difference than just using lm()
or glm()
functions and then predict()
.
When valuing a model, the fundamental is to have a clear view of the data and the type of manipulation to obtain the answer to our questions
The first step is to evaluate the structure of the data, it needs to be balanced for deciding a smart strategy for allocating data.
Smart Strategy for allocating data:
- allocate specific subsets of data for different tasks
- allocate the largest possible amount to the model parameter estimation only
The second is to split the data into two main parts as train and test sets, sometimes a third set is valued such as the validation set.
Splitting data:
training set (the substrate to develop the model and estimate the parameters)
validation set (a small set of data to measure performance as the network was trained)
test set (the final (unbiased) arbiter to determine the efficacy of the model)
data(ames, package = "modeldata")
set.seed(123)
<- initial_split(ames, prop = 0.80)
ames_split <- training(ames_split)
ames_train <- testing(ames_split) ames_test
The step-by-step model decomposes the modeling procedure into a certain number of customized steps.
Preprocessing model functions:
- recipe()
- prep()
- bake()
Recipe objects for feature engineering and data preprocessing prior to modeling.
Transformations and encoding of the data with recipes::recipe()
help choosing the option which is the most associated with the outcome, through data preprocessing techniques applying it for different models.
recipe()
defines the formula and allows for the preprocessing steps of a model with the help of the step_**
functions
The matrix data transformation is obtained from a data frame through the modeling procedure of preprocessing data to obtain a new designed matrix.
9.2.3 Workflow: to combine models and recipes
Once the preprocessing steps are done, data are allocated to test and training sets, and the model engine is set, the next step involves wrapping everything up into a workflow to have all the steps together.
9.2.4 Case Study 2
NYC flights data modeling
Let’s see this in practice with an example taken from: https://www.tidymodels.org/start/recipes/
Data are from {nycflights13} package:
library(nycflights13)
# tidy data and manipulation
set.seed(123)
<-
flight_data %>%
flights # data modification
mutate(arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
date = as.Date(time_hour)) %>%
inner_join(weather, by = c("origin", "time_hour")) %>%
select(dep_time, flight, origin, dest, air_time, distance,
%>%
carrier, date, arr_delay, time_hour) na.omit() %>%
mutate_if(is.character, as.factor)
%>% head flight_data
## # A tibble: 6 × 10
## dep_time flight origin dest air_time distance carrier date arr_delay
## <int> <int> <fct> <fct> <dbl> <dbl> <fct> <date> <fct>
## 1 517 1545 EWR IAH 227 1400 UA 2013-01-01 on_time
## 2 533 1714 LGA IAH 227 1416 UA 2013-01-01 on_time
## 3 542 1141 JFK MIA 160 1089 AA 2013-01-01 late
## 4 544 725 JFK BQN 183 1576 B6 2013-01-01 on_time
## 5 554 461 LGA ATL 116 762 DL 2013-01-01 on_time
## 6 554 1696 EWR ORD 150 719 UA 2013-01-01 on_time
## # ℹ 1 more variable: time_hour <dttm>
%>%
flight_dataggplot(aes(arr_delay))+
geom_bar()
Spending our data:
## split
set.seed(555)
<- initial_split(flight_data, prop = 3/4)
data_split <- training(data_split)
train_data <- testing(data_split) test_data
Apply the recipe:
<-
flights_rec recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
# remove columns from the data when the training
# set data have a single value
step_zv(all_predictors())
%>%
flights_rec prep() %>%
bake(new_data = NULL) %>%
head
## # A tibble: 6 × 161
## dep_time flight air_time distance time_hour arr_delay
## <int> <int> <dbl> <dbl> <dttm> <fct>
## 1 656 2116 39 184 2013-02-20 07:00:00 on_time
## 2 2030 1465 339 2586 2013-07-25 19:00:00 late
## 3 1712 3448 92 589 2013-08-06 17:00:00 on_time
## 4 853 4087 96 645 2013-09-22 08:00:00 on_time
## 5 1709 652 39 200 2013-09-28 17:00:00 on_time
## 6 2257 608 46 273 2013-01-22 22:00:00 on_time
## # ℹ 155 more variables: date_USChristmasDay <int>, date_USColumbusDay <int>,
## # date_USCPulaskisBirthday <int>, date_USDecorationMemorialDay <int>,
## # date_USElectionDay <int>, date_USGoodFriday <int>,
## # date_USInaugurationDay <int>, date_USIndependenceDay <int>,
## # date_USJuneteenthNationalIndependenceDay <int>, date_USLaborDay <int>,
## # date_USLincolnsBirthday <int>, date_USMemorialDay <int>,
## # date_USMLKingsBirthday <int>, date_USNewYearsDay <int>, …
Fit a model with a recipe and apply the workflow:
- process the recipe using the training set
- apply the recipe to the training set
- apply the recipe to the test set
<-
lr_mod logistic_reg() %>%
set_engine("glm")
Workflow: to simplify the process a parsnip object pairs a model and recipe together
<-
flights_wflow workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_date()
## • step_holiday()
## • step_rm()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Fit: The function to prepare the recipe and train the model from the resulting predictors
# this takes a bit
<-
flights_fit %>%
flights_wflow # prepare the recipe and train the model
fit(data = train_data)
Extract the model or recipe objects from the workflow:
%>%
flights_fit # pull_workflow_fit() %>%
tidy() %>%
head
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.26 2.72 1.93 5.36e- 2
## 2 dep_time -0.00167 0.0000141 -118. 0
## 3 air_time -0.0438 0.000561 -78.1 0
## 4 distance 0.00615 0.00150 4.10 4.12e- 5
## 5 date_USChristmasDay 1.14 0.171 6.70 2.12e-11
## 6 date_USColumbusDay 0.626 0.169 3.71 2.10e- 4
Use a trained workflow to predict following these steps:
- build the model (lr_mod),
- create a preprocessing recipe (flights_rec),
- bundle the model and recipe (flights_wflow), and
- train the workflow using a single call to fit()
Then use the trained workflow to predict using the test data or any other new data:
<-
flights_pred predict(flights_fit, test_data, type = "prob") %>%
bind_cols(test_data %>%
select(arr_delay, time_hour, flight))
%>%head flights_pred
## # A tibble: 6 × 5
## .pred_late .pred_on_time arr_delay time_hour flight
## <dbl> <dbl> <fct> <dttm> <int>
## 1 0.0183 0.982 on_time 2013-01-01 06:00:00 461
## 2 0.0426 0.957 on_time 2013-01-01 06:00:00 5708
## 3 0.0413 0.959 on_time 2013-01-01 06:00:00 71
## 4 0.0253 0.975 on_time 2013-01-01 06:00:00 194
## 5 0.0306 0.969 on_time 2013-01-01 06:00:00 1743
## 6 0.0236 0.976 on_time 2013-01-01 06:00:00 1077
Figure 9.3 Graph of the workflow to predictHow will we evaluate the performance of our workflow?
To finally answer our question we need to check the area under the ROC curve.
What is a ROC curve?
It is a curve that identifies the area of credibility of our model. To calculate the curve we use two functions: roc_curve()
and roc_auc()
, the curve and the area under the curve, respectively.
The ROC curve uses the class probability estimates to give us a sense of performance across the entire set of potential probability cutoffs.
Once the predicted class of probabilities are obtained we can generate a ROC curve. Let’s see it applied in our NYC flights case study. We need late and on_time variable predictors to create the curve and apply the autoplot()
method as shown below:
%>%
flights_pred roc_curve(truth = arr_delay, .pred_late)%>%
ggplot(aes(x=1-specificity,y=sensitivity))+
geom_abline(linetype="dashed")+
geom_line()+
coord_equal()
%>%
flights_pred roc_curve(truth = arr_delay, .pred_late) %>%
autoplot()
%>%
flights_pred roc_auc(truth = arr_delay, .pred_late)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.761
In particular, below are shown the steps for different cases, and the visualization used to evaluate the credibility of a model.