5.9 Meeting Videos

5.9.1 Cohort 1

Meeting chat log
00:09:24    Jon Harmon (jonthegeek):    Oops typo! Tan shouldn't have accepted that!
00:10:05    Tan Ho: umm
00:10:18    Tyler Grant Smith:  I never take notes, but if I did, I wish I would take them like this
00:10:24    Jon Harmon (jonthegeek):    Certainly can't be MY fault for TYPING it.
00:10:32    Tan Ho: what typo are we talking about?
00:10:36    Tony ElHabr:    you don't need docs if you got diagrams like this
00:10:43    Jon Harmon (jonthegeek):    "too list training data"
00:10:55    Tan Ho: also...is "data spending" a hadleyism?
00:10:57    Tyler Grant Smith:  I also object to this order
00:11:11    Jonathan Trattner:  @Jon My professor always gets mad at R for doing what she tells it to instead of what she wants it to do
00:11:12    Jon Harmon (jonthegeek):    I think it's a Maxim. And I like it.
00:11:17    Tyler Grant Smith:  preprocessing needs to be done after the split
00:11:31    Tyler Grant Smith:  some of it does anyway...
00:11:40    Jonathan Trattner:  Does PCA Tyler?
00:11:58    Tyler Grant Smith:  yes, I would say so
00:12:35    Jonathan Trattner:  👍🏼
00:12:47    Jon Harmon (jonthegeek):    We'll talk about processing in the next chapter :)
00:12:56    Tony ElHabr:    pre-processing is done after the splitting in the normal tidy workflow. I guess the diagram was just "wrong"?
00:13:38    Jon Harmon (jonthegeek):    It can make sense to do the processing before splitting if you don't have a nice system like recipes to make sure they're processed the same.
00:14:07    Tyler Grant Smith:  it can make sense to be wrong too :)
00:14:15    Jonathan Trattner:  Also if you can reduce the dimensionality of it before hand, would it not make sense to do that first and split the simpler data?
00:14:29    Jon Harmon (jonthegeek):    The idea is you should treat your test data the same as you'd treat new data.
00:14:54    Jon Harmon (jonthegeek):    If you do it before the split, you might do something that's hard to do or might include it in an ~average, etc, and thus leak into the training data.
00:15:12    Jonathan Trattner:  That makes sense, thanks!
00:15:16    Jarad Jones:    Class imbalance, perfect! I was hoping to go over how to decide between upsampling or downsampling
00:15:39    Jon Harmon (jonthegeek):    We won't do much there yet, he goes into it more in 10 I think.
00:15:59    Jon Harmon (jonthegeek):    But feel free to ask Asmae about it!
00:16:12    Jarad Jones:    Haha, shoot, will have to wait a bit then
00:17:02    Tyler Grant Smith:  question for later:  for what types models is upsampling/downsampling suggested/necessary?  I find in xgboost, for example, that I rarely need to do it.  or at least that it doesn't make the model results any better
00:18:09    Maya Gans:  +1 this question ^^^
00:18:13    Conor Tompkins: Tabyl is such a useful function
00:18:29    Tyler Grant Smith:  janitor as a whole is fantastic
00:18:45    Jordan Krogmann:    janitor::clean_names() mvp
00:18:56    Jonathan Trattner:  Huge facts ^^
00:18:58    Tyler Grant Smith:  ^
00:19:03    Jon Harmon (jonthegeek):    Correction: He briefly mentions upsampling in the next chapter.
00:19:09    arjun paudel:   is it prob or prop? I thought the argument for initial_split was prop
00:19:25    Scott Nestler:  Yes!  We recently did a "Blue Collar Data Wrangling" class with coverage of janitor and plumber.
00:19:29    Tony ElHabr:    the upsampling/downsampling question is a good one. I think frameworks that use boosting/bagging may not need it, but it's always worth testing
00:20:07    Tony ElHabr:    the downside is not using stratification
00:20:36    Tan Ho: always log, always stratify
00:20:37    Tan Ho: got it
00:22:14    Tan Ho: *looks around nervously*
00:22:51    Jordan Krogmann:    I mean youre not going to not log
00:23:12    Jordan Krogmann:    *waiting for the number of counter articles*
00:24:00    Jon Harmon (jonthegeek):    Woot, I have a PR accepted in this book now (for a minor typo at the end of this chapter) :)
00:24:01    Tyler Grant Smith:  I gotta imagine that stratified sampling and random sampling converge as n->inf
00:24:23    Tony ElHabr:    law of large numbers
00:24:25    Tyler Grant Smith:  and it happens probably pretty quickly
00:24:43    Jon Harmon (jonthegeek):    Yeah, I guess a downside would be if you stratify so much that it doesn't make sense and causes rsample to complain.
00:25:12    Jon Harmon (jonthegeek):    There's a minor change starting next chapter, not yet merged: https://github.com/tidymodels/TMwR/pull/106/files
00:27:55    Tyler Grant Smith:  i frequently work with data like this
00:28:18    Conor Tompkins: It would be interesting to have a table of model types and how they react to things like missingness, class imbalance, one-hot encoding etc. so you can choose the appropriate model for the specific weirdness of your data.
00:28:36    Tony ElHabr:    so at what point do you use longitudinal model over something else
00:29:31    Jordan Krogmann:    student re-enrollment cycle... how does the last term impact future terms
00:31:14    Tony ElHabr:    memes in the wild
00:31:17    Tony ElHabr:    i'm here for it
00:31:20    Jon Harmon (jonthegeek):    Yup! And there's a whole thing about the fact that each question a student answers technically influences the next one, even if they don't get feedback.
00:32:57    Scott Nestler:  I recall learning (many years ago) about using 3 sets -- Training, Test, and Validation.  Training to train/build models, Validation to assess the performance of different (types of) models on data not used to train them, and then Test to fine-tune model parameters once you have picked one.  The splits were usually something like 70/15/15 or 80/10/10.  This didn't seem to be discussed in this chapter.  Any idea why?
00:33:37    Jon Harmon (jonthegeek):    We'll talk about validation later, I think. There's a minute of it. Gonna talk about this out loud in a sec...
00:34:43    Tyler Grant Smith:  5.3 What about a validation set?
00:35:49    Tony ElHabr:    If you do cross-validation, the CV eval metric is effectively your validation
00:35:50    Jonathan Trattner:  What about cross-validation on the training set? Is that different than what we’re discussing now?
00:35:53    Tony ElHabr:    and your training
00:36:10    Tyler Grant Smith:  ya...split first train-validate and test   and then split train-validate into train and validate
00:36:42    Jarad Jones:    I think cross-validation is used during model training on the training set
00:37:08    Ben Gramza: I actually watched a "deep-learning" lecture on this today. The guy said that a validation set is used to select your parameters/hyperparameters, then you test your tuned model on the test set. 
00:40:11    Tony ElHabr:    validation makes more sense when you're comparing multiple model frameworks too. the best one on the validation set is what is ultimately used for the test set
00:41:45    Jordan Krogmann:    i think it comes into play when you are hyperparameter tuning for a single model
00:44:21    Ben Gramza: yeah, for example if you are using a K-nearest neighbor model, you use the validation set on your models with K=1, 2, 3, … . You select the best performing K from the validation set, then test that on the test set. 
00:46:22    Joe Sydlowski:  Good question!
00:46:28    Jordan Krogmann:    i do it on all of it
00:46:45    Jordan Krogmann:    annnnnnnnnnd i am probably in the wrong lol
00:50:20    Jordan Krogmann:    yuup otherwise you will cause leakage
00:57:41    Tyler Grant Smith:  i suppose I need to add inviolate to my day-to-day vernacular
00:58:52    Jon Harmon (jonthegeek):    I'm noticing myself say that over and over and I don't know why!
00:59:50    Tony ElHabr:    i had to google that
01:05:17    Conor Tompkins: Great job asmae!
01:05:22    Jonathan Trattner:  ^^^
01:05:28    Tony ElHabr:    Pavitra getting ready for recipes
01:05:37    Jordan Krogmann:    great job!
01:05:42    Joe Sydlowski:  Thanks Asmae!
01:05:46    Andy Farina:    That was great Asmae, thank you!
01:05:47    Pavitra Chakravarty:    🤣🤣🤣🤣
01:05:56    caroline:   Thank you Asmae :)
01:05:59    Pavitra Chakravarty:    great presentation Asmae

5.9.2 Cohort 2

Meeting chat log
00:07:17    Janita Botha:   Sorry I'm late... Been slow booting up...
00:08:46    Amélie Gourdon-Kanhukamwe (she/they):   https://supervised-ml-course.netlify.app/
00:08:58    shamsuddeen:    Thank you
00:09:07    Stephen Holsenbeck: thanks!
00:22:06    Janita Botha:   Just a side warren... I find the focus on testing vs training data in tidymodels very frustrating since the field that I am in focusses more on inferential statistics because we tend to have relatively small sample sizes for the large amount of variance we encounter
00:22:52    Janita Botha:   In other words in my field my data is ususally better "spent" as traininig data...
00:38:09    Louis Carlo Medina: Thanks Rahul! yeah, I think I conflated oversampling with strata. I think I remember the strata now, where you actually sample within groups as opposed to the group as a whole.
00:38:29    shamsuddeen:    Yes, !
00:41:00    rahul bahadur:  No worries.
00:42:26    Mikhael Manurung:   It should be random regardless whether strata is specified or not
00:43:00    rahul bahadur:  For stratified random sampling, the strata are created first and then random samples are taken from it
00:43:31    shamsuddeen:    Why don’t we stratified all the time?
00:44:14    rahul bahadur:  it is not needed when the data is balanced. However, you can
00:44:16    shamsuddeen:    The book says: “There is very little downside to using stratified sampling. “
00:44:40    shamsuddeen:    Ah, I see.  Stratified is for imbalance data
00:44:52    shamsuddeen:    Thanks raul.
00:45:01    Stephen Holsenbeck: stratification should basically be the default
00:45:06    shamsuddeen:    *rahul
00:45:36    Stephen Holsenbeck: If you go completely random, your classes in the test set may not match the category distributions in the dataset
00:45:45    August: https://otexts.com/fpp2/accuracy.html
00:45:48    Stephen Holsenbeck: same with the training set
00:46:02    August: this is a good diagram for time series cross validation
00:46:34    August: Training test at top of section
00:46:46    Janita Botha:   @Amelie that is a really good question - you should add that to the questions for Julia and Max
00:47:11    Louis Carlo Medina: ^+1. Hyndman et al's texts for timeseries stuff are really good
00:47:43    rahul bahadur:  Yes, Hyndman has good timeseries
00:49:43    shamsuddeen:    From the book.
00:49:44    shamsuddeen:    Too much data in the training set lowers the quality of the performance estimates.
00:49:52    shamsuddeen:    too much data in the test set handicaps the model’s ability to find appropriate parameter estimates.
00:50:14    shamsuddeen:    What is difference between performance estimates and parameter estimates.?
00:51:26    Janita Botha:   https://otexts.com/fpp3/
00:51:35    rahul bahadur:  Parameter estimates, in case of regression for example, would be the beta estimates
00:51:38    Amélie Gourdon-Kanhukamwe (she/they):   Performance estimates = grossly statistics assessing the quality of the model, such as MSE, percentage correct, area under the curve
00:51:45    rahul bahadur:  Performance estimates = MSE
00:51:47    Amélie Gourdon-Kanhukamwe (she/they):   And yes, as Rahul
00:52:08    Janita Botha:   I shared the link above
00:52:13    Kevin Kent: Modeltime - https://cran.r-project.org/web/packages/modeltime/vignettes/getting-started-with-modeltime.html
00:52:32    Amélie Gourdon-Kanhukamwe (she/they):   Thanks Janitha
00:55:56    Kevin Kent: https://physionet.org/
00:59:34    Amélie Gourdon-Kanhukamwe (she/they):   https://cran.r-project.org/web/packages/anomalize/vignettes/anomalize_quick_start_guide.html
00:59:38    August: https://cran.r-project.org/web/packages/anomalize/anomalize.pdf
00:59:58    August: https://cran.r-project.org/web/packages/anomalize/vignettes/anomalize_methods.html

5.9.3 Cohort 3

Meeting chat log
00:23:04    Daniel Chen (he/him):   I don't know why it's 10%
00:23:23    Daniel Chen (he/him):   it seems like a good heuristic?
00:29:33    Daniel Chen (he/him):   (Strata below 10% of the total are pooled together.)
00:54:00    Ildiko Czeller: https://spatialsample.tidymodels.org/reference/spatialsample.html

5.9.4 Cohort 4

Meeting chat log
00:17:06    Isabella Velásquez: Need an R2Cobalt package...
00:17:16    Laura Rose: yes!
00:28:03    Isabella Velásquez: https://juliasilge.com/blog/
00:29:36    Isabella Velásquez: Supervised Machine Learning for Text Analysis in R: https://smltar.com/
00:29:45    Isabella Velásquez: ISLR tidymodels Labs: https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/index.html
00:31:46    Isabella Velásquez: A couple more posts! https://www.danielphadley.com/bootstrap_tutto/
00:31:51    Isabella Velásquez: https://www.rebeccabarter.com/blog/2020-03-25_machine_learning/
00:32:10    Stephen Charlesworth:   Here’s a spreadsheet w/ different tidymodels uses: https://docs.google.com/spreadsheets/d/1PzbtanwieCCpaUt9G6XMhKXQ_0cBWfq_9xG9bH33Occ/edit#gid=652338894
00:33:34    Federica Gazzelloni:    https://www.tidymodels.org/
00:52:38    Stephen Charlesworth:   https://twitter.com/towards_ai/status/1332567246011555840
00:53:06    Laura Rose: 😆
00:55:13    Isabella Velásquez: https://en.wikipedia.org/wiki/Pareto_principle
01:05:39    Laura Rose: cross validation using modeltime https://cran.r-project.org/web/packages/modeltime.resample/vignettes/getting-started.html