9.3 Functions used to measure predictive strengths of a model
The assessment of the models is via empirical validation
and grouped by the nature of the outcome data, and this can be done through:
Regression:
- regression metrics (purely numeric)
Classification:
- binary classes
- multilevel metrics (three or more class levels)
9.3.1 Case Study 3
For this example data is from The Trust for Public Land for ranking the public parks in the US. The dataset ranges within a period between 2012 and 2020, in 102 US cities, parks are ranked by characteristics of services.
In particular we will be looking at selected amenities
in the parks, which are things that conduce to comfort, convenience, or enjoyment.
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-06-22/readme.md
Data load
library(tidytuesdayR)
<- tidytuesdayR::tt_load(2021, week = 26) tuesdata
##
## Downloading file 1 of 1: `parks.csv`
<- tuesdata$parks
parks %>%head parks
## # A tibble: 6 × 28
## year rank city med_park_size_data med_park_size_points park_pct_city_data
## <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 2020 1 Minnea… 5.7 26 15%
## 2 2020 2 Washin… 1.4 5 24%
## 3 2020 3 St. Pa… 3.2 14 15%
## 4 2020 4 Arling… 2.4 10 11%
## 5 2020 5 Cincin… 4.4 20 14%
## 6 2020 6 Portla… 4.9 22 18%
## # ℹ 22 more variables: park_pct_city_points <dbl>, pct_near_park_data <chr>,
## # pct_near_park_points <dbl>, spend_per_resident_data <chr>,
## # spend_per_resident_points <dbl>, basketball_data <dbl>,
## # basketball_points <dbl>, dogpark_data <dbl>, dogpark_points <dbl>,
## # playground_data <dbl>, playground_points <dbl>, rec_sr_data <dbl>,
## # rec_sr_points <dbl>, restroom_data <dbl>, restroom_points <dbl>,
## # splashground_data <dbl>, splashground_points <dbl>, …
%>%names parks
## [1] "year" "rank"
## [3] "city" "med_park_size_data"
## [5] "med_park_size_points" "park_pct_city_data"
## [7] "park_pct_city_points" "pct_near_park_data"
## [9] "pct_near_park_points" "spend_per_resident_data"
## [11] "spend_per_resident_points" "basketball_data"
## [13] "basketball_points" "dogpark_data"
## [15] "dogpark_points" "playground_data"
## [17] "playground_points" "rec_sr_data"
## [19] "rec_sr_points" "restroom_data"
## [21] "restroom_points" "splashground_data"
## [23] "splashground_points" "amenities_points"
## [25] "total_points" "total_pct"
## [27] "city_dup" "park_benches"
EDA: Exploratory data analysis
ggplot(parks,aes(x=year,y = rank))+
geom_col(aes(fill=city)) +
labs(x="Year", y = "Rank values")+
guides(fill="none")+
labs(title="US City Ranks per Year")+
theme_minimal()
Select three years 2018, 2019, and 2020 and 99 cities, with full information.
<- parks %>%
parks_long select(-amenities_points,-total_points,
-contains("_data"),
-park_benches,-city_dup)%>%
drop_na() %>%
pivot_longer(
cols = contains("_points"),
names_to = "amenities",
values_to = "points"
%>%
) mutate(amenities=gsub("_points","",amenities))
%>% head parks_long
## # A tibble: 6 × 6
## year rank city total_pct amenities points
## <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 2020 1 Minneapolis 85.3 med_park_size 26
## 2 2020 1 Minneapolis 85.3 park_pct_city 38
## 3 2020 1 Minneapolis 85.3 pct_near_park 98
## 4 2020 1 Minneapolis 85.3 spend_per_resident 100
## 5 2020 1 Minneapolis 85.3 basketball 47
## 6 2020 1 Minneapolis 85.3 dogpark 65
%>%
parks_long ggplot(aes(x = total_pct, y = rank, group=year,color=factor(year))) +
geom_point(size = 0.5,
alpha = 0.7) +
geom_smooth(aes(color=factor(year)),linewidth=0.3,se=F) +
scale_y_reverse()+
scale_color_viridis(discrete = TRUE) +
labs(title = "Amenities points",color="Year") +
theme_minimal()
Let’s pose some questions before to choose:
What are we going to predict?
What is our research question?
Data Split
set.seed(123)
<- initial_split(parks_long, strata=rank,prop = 0.80)
parks_split <- training(parks_split)
parks_train <- testing(parks_split) parks_test
Preprocessing steps: recipe() %>% step_*.*
This step involves setting the model formula and eventually make some data preprocessing with the help of the step_*.*
functions.
In this case we don’t make any extra manipulations, in the first step of our model.
<-
parks_rec recipe(
~ ., data = parks_train
rank )
Set the Workflow
Wrap everything into a workflow.
# set model engine
<- linear_reg() %>% set_engine("lm")
lm_model
# use a workflow
<-
lm_wflow workflow() %>%
add_model(lm_model) %>%
add_recipe(parks_rec)
Fit the workflow
with the training set
.
<- fit(lm_wflow, parks_train)
lm_fit
%>% tidy() %>% head lm_fit
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2162. 88.0 -24.6 3.71e-118
## 2 year 1.14 0.0437 26.1 6.81e-131
## 3 cityAnaheim 1.62 0.477 3.39 7.06e- 4
## 4 cityAnchorage -1.08 0.438 -2.46 1.40e- 2
## 5 cityArlington, Texas 3.21 0.504 6.38 2.22e- 10
## 6 cityArlington, Virginia 9.39 0.621 15.1 2.53e- 49
Predict
withnew data
from thetesting set
# predict(lm_fit, parks_test %>% slice(1:3))
# test the model on new data
<- predict(lm_fit,
pred new_data = parks_test %>%
filter(city %in% c("Seattle","Atlanta","Baltimore")))
<- predict(lm_fit,
parks_test_res new_data = parks_test %>% select(-rank)) %>%
bind_cols(parks_test %>% select(rank))
%>%head parks_test_res
## # A tibble: 6 × 2
## .pred rank
## <dbl> <dbl>
## 1 0.0862 1
## 2 0.180 1
## 3 0.187 1
## 4 2.66 2
## 5 3.76 3
## 6 3.90 3
ggplot(parks_test_res, aes(x = rank, y = .pred)) +
# Create a diagonal line:
geom_abline(lty = 2) +
geom_point(alpha = 0.5) +
labs(y = "Predicted Rank", x = "Rank") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
%>%
lm_fitaugment(new_data = parks_test %>%
filter(city %in% c("Seattle","Atlanta","Baltimore")))%>%
group_by(city)%>%
reframe(rank=mean(rank),.pred=mean(.pred))
## # A tibble: 3 × 3
## city rank .pred
## <chr> <dbl> <dbl>
## 1 Atlanta 41.2 41.9
## 2 Baltimore 60 60.0
## 3 Seattle 11 10.7
ggplot() +
geom_point(data=parks_long,
aes(x = total_pct, y = (rank)),
color="grey0",
size = 0.5,
alpha = 0.7)+
scale_y_reverse()+
geom_smooth(data= lm_fit%>%
augment(new_data = parks_test),
aes(total_pct,.pred),
color="darkred",
size = 0.5,
alpha = 0.7)
- Apply the root mean squared error
rmse()
The first measure used for the model is the root mean squared error: RMSE
# rmse(data, truth = outcome, estimate = .pred)
rmse(parks_test_res, truth = rank, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.57
Then make a comparison adding more metrics at once: Multiple metrics at once
# data_metrics <- metric_set(rmse, rsq, mae)
# data_metrics(data_test_res, truth = outcome, estimate = .pred)
<- metric_set(rmse, rsq, mae)
parks_metrics parks_metrics(parks_test_res, truth = rank, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.57
## 2 rsq standard 0.997
## 3 mae standard 1.21
Here we use some examples from the book with a sample predictions and multiple resampling:
Binary classes
- Confusion matrix:
Confusion matrix gives a holistic view of the performance of your model
What is a Confusion Matrix?
It is a matrix that contains values such as:
- True Positive (TP)
- True Negative (TN)
- False Positive – Type 1 Error (FP)
- False Negative – Type 2 Error (FN)
Example 1: two_class_example
data("two_class_example")
conf_mat(two_class_example, truth = truth, estimate = predicted)
## Truth
## Prediction Class1 Class2
## Class1 227 50
## Class2 31 192
The confusion matrix contains other metrics that can be extracted under specific conditions.
- Precision is how certain you are of your true positives
- Recall is how certain you are that you are not missing any positives.
The measure used to estimate the effectiveness is the overall accuracy. It uses the hard class predictions to measure performance, which tells us whether our model is actually estimating a probability of cutoff to establish if the model predicted well or not with accuracy.
- Accuracy:
accuracy(two_class_example, truth = truth, estimate = predicted)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.838
- Matthews correlation coefficient:
mcc(two_class_example, truth, predicted)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mcc binary 0.677
- F1 metric: F1-score is a harmonic mean of Precision and Recall. The F1-score captures both the trends in a single value: when we try to increase the precision of our model, the recall (aka, sensitivity) goes down, and vice-versa.
f_meas(two_class_example, truth, predicted) #,event_level = "second")
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.849
All of the above have the event_level argument (first/second level)
To visualize the model metrics behavior, the receiver operating characteristic (ROC) curve computes the sensitivity and specificity over a continuum of different event thresholds
- roc_curve() (curve)
- roc_auc() (area)
<- roc_curve(two_class_example, truth, Class1)
two_class_curve
%>% head two_class_curve
## # A tibble: 6 × 3
## .threshold specificity sensitivity
## <dbl> <dbl> <dbl>
## 1 -Inf 0 1
## 2 1.79e-7 0 1
## 3 4.50e-6 0.00413 1
## 4 5.81e-6 0.00826 1
## 5 5.92e-6 0.0124 1
## 6 1.22e-5 0.0165 1
roc_auc(two_class_example, truth, Class1)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.939
autoplot(two_class_curve)
Multi-class
Finally we see data with three or more classes
- Accuracy:
data(hpc_cv)
%>%head hpc_cv
## obs pred VF F M L Resample
## 1 VF VF 0.9136340 0.07786694 0.008479147 1.991225e-05 Fold01
## 2 VF VF 0.9380672 0.05710623 0.004816447 1.011557e-05 Fold01
## 3 VF VF 0.9473710 0.04946767 0.003156287 4.999849e-06 Fold01
## 4 VF VF 0.9289077 0.06528949 0.005787179 1.564496e-05 Fold01
## 5 VF VF 0.9418764 0.05430830 0.003808013 7.294581e-06 Fold01
## 6 VF VF 0.9510978 0.04618223 0.002716177 3.841455e-06 Fold01
%>%
hpc_cvaccuracy(obs, pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.709
- Matthews correlation coefficient:
mcc(hpc_cv, obs, pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mcc multiclass 0.515
And then the sensitivity calculation for different estimators:
<-sensitivity(hpc_cv, obs, pred, estimator = "macro");
macro_sens <- sensitivity(hpc_cv, obs, pred, estimator = "macro_weighted");
weigh_sens <- sensitivity(hpc_cv, obs, pred, estimator = "micro");
micro_sens <- rbind(macro_sens,weigh_sens,micro_sens)
sens sens
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sensitivity macro 0.560
## 2 sensitivity macro_weighted 0.709
## 3 sensitivity micro 0.709
And the ROC curve:
roc_auc(hpc_cv, obs, VF, F, M, L)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 0.829
The ROC area:
roc_auc(hpc_cv, obs, VF, F, M, L, estimator = "macro_weighted")
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc macro_weighted 0.868
The ROC visualization: https://www.tidymodels.org/start/resampling/
%>%
hpc_cv group_by(Resample) %>%
roc_curve(obs, VF, F, M, L) %>%
autoplot()
9.3.2 Conclusion
Judging the effectiveness of a variety of different models and to choose between them, we need to consider how well these models behave through the use of some performance statistics:
- the area under the Receiver Operating Characteristic (ROC) curve, and
- overall classification accuracy.
In conclusion when judging on a model effectiveness is important to follow few clear steps:
- check of the data used for the modeling if contains any of the hidden information, such as modification of the units
- second step is to calculate the ROC curve and the Area underneath the curve, plot it to see how it behaves on the model
- third is to apply some selected metrics such as RMSE or RSQ, MAE etc.. to evaluate the estimation values
- fourth make the confusion matrix as well as all the related metrics (sensitivity, specificity, accuracy, F1 …)
- finally apply again the ROC curve visualization on resampling to see the best fit
9.3.2.1 Resources
- yardstick: https://yardstick.tidymodels.org/
- recipes: https://www.tidymodels.org/start/recipes/
- ROC curve: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/?utm_source=linkedin&utm_medium=social&utm_campaign=old-blog&utm_content=B&custom=LDV150
- Decoding the confusion matrix: https://towardsdatascience.com/decoding-the-confusion-matrix-bb4801decbb