8.9 Meeting Videos

8.9.1 Cohort 1

Meeting chat log

00:06:28    Tyler Grant Smith:  getting kind of scruffy jon
00:07:22    Jim Gruman: {purr}
00:07:37    Jim Gruman: {purrr}
00:18:31    Tony ElHabr:    is this thing on?
00:18:39    Jonathan Trattner:  The chat?
00:18:43    Jon Harmon (jonthegeek):    I know, it's so quiet over here!
00:19:05    Tony ElHabr:    quiet chat is making me nervous
00:19:27    Jonathan Trattner:  I’ll make some noise
00:19:38    Jonathan Trattner:  🔈
00:20:06    Tyler Grant Smith:  it would be good (in the bookdown) to have a comparison of stratified vs non-stratified sampling for this example.  with a comparison of the distributions
00:20:07    Tony ElHabr:    ugh I read chapter 7
00:21:38    Asmae Toumi:    Wait what does all_nominal do, missed it
00:21:51    Jon Harmon (jonthegeek):    Selects all columns that... what she's saying :D
00:21:54    Asmae Toumi:    Oh ok nvmd
00:22:30    yonis:  We basically we to make to create a design matrix for the regression
00:22:40    Joe Sydlowski:  For clarity to Jon's answer it won't include numeric vars, right?
00:22:52    Tony ElHabr:    right
00:23:06    Jon Harmon (jonthegeek):    all_numeric() is its counterpart
00:23:59    Conor Tompkins: step_dummy() drops the reference level, I think
00:24:46    yonis:  That is tricky. the reference level isn’t lined up with how base r is defined so you need to be careful with that
00:25:47    yonis:  I ran a logistic regression and got into all kinds of trouble with how recipe was defining the ref level
00:27:11    arjun paudel:   does it not set the reference level based on order of factor levels? that was my understanding
00:27:29    arjun paudel:   if you want a specific level as you reference, you reorder your factor
00:27:49    Conor Tompkins: That is my understanding as well Arjun
00:29:17    Conor Tompkins: I use step_relevel to set the reference level
00:30:15    yonis:  https://recipes.tidymodels.org/reference/step_relevel.html
00:30:20    Tyler Grant Smith:  why are the counts almost monotonic, but not monotonic?
00:31:42    Conor Tompkins: This is a great table to show this
00:35:44    Scott Nestler:  I don't follow the mention of one-hot encoding in the book. Why would you use that instead of binary encoding like was just shown here.
00:36:37    Asmae Toumi:    I say Tilda, is that right?
00:38:01    Tony ElHabr:    one-hot: like removing the intercept term in your regression with a univariate categorical variable. so you get coefficients for each term
00:38:16    Conor Tompkins: The winner of the big data bowl determines the pronunciation, I think
00:38:27    Asmae Toumi:    Lmaooooooo
00:38:32    Tan Ho: asmae v tony, fightttt
00:38:53    Joe Sydlowski:  The benefit of one hot encoding is that you don't need to know (or explain) what the reference variable is when interpreting the coefficients. Not ever model can use one hot encoding though
00:39:04    Scott Nestler:  @ Tony: But wouldn't that create the linear dependency problem as is discussed in the text?
00:39:35    Scott Nestler:  Thanks, Joe.  Got it.  That makes sense.
00:40:26    Tony ElHabr:    i think you're right about that Scott. or maybe my analogy was just bad
00:41:06    Conor Tompkins: Are there best practices for determining the appropriate reference level? I typically use the most common level
00:42:39    Scott Nestler:  ICA is my favorite type of feature extraction to use.  Makes use of higher-level moments than PCA, resulting in components that are truly statistically independent, not just uncorrelated.
00:43:04    arjun paudel:   @Conor, what level you want as reference is entirely based on context of the problem, I don't think one standard way of determining the reference level would make sense
00:43:37    Asmae Toumi:    conor I pick mine in a way that makes interpretation easier for ppl who digest the findings
00:44:05    Scott Nestler:  Agree with Arjun and Asmae; it depends on the variable and ease of interpretation.
00:44:16    Tony ElHabr:    any kind of thought put into reference level is probably better than alphabetical imo
00:44:55    Tony ElHabr:    Scott, do you have a good reference on ICA?
00:45:57    Jordan Krogmann:    Has anyone had to create the "bake" function in sql?  Let me tell you it's less than fun for new records hitting your model...
00:46:12    Scott Nestler:  Book by Hyvarinen, Karhunen, Oja is one standard.  Book by Stone is more approachable.  I have a presentation on it that I developed too.  Happy to share.
00:46:25    Daniel Chen:    how is the recipie object implemented? is it a dataframe with an attribute table that defines whether or not a variable is a predictor or response?
00:47:22    Jon Harmon (jonthegeek):    There's a tibble in there, but it has a lot going on. I... can't remember details, it's been a bit since I dug into it.
00:48:47    Daniel Chen:    s3 objects are just lists with an attr defined, right?
00:50:16    Jon Harmon (jonthegeek):    They're not necessarily lists.
00:50:33    Tony ElHabr:    attributes are the magic to S3
00:50:48    Tony ElHabr:    but right, not necessarily lists
00:50:49    Tan Ho: deep, dark magic
00:50:52    Jon Harmon (jonthegeek):    If I remember right, recipes are lots o' attributes.
00:50:56    Tyler Grant Smith:  sounds like someone wants to write dbrecipes
00:51:04    Tony ElHabr:    sounds like ETL
00:51:15    Asmae Toumi:    dbrecipes omg
00:51:46    Conor Tompkins: Luv 2 engineer data
00:51:53    Tan Ho: "do all the work for you" eh
00:52:35    arjun paudel:   hahaa
00:52:56    arjun paudel:   i meant you don't have to prep or bake it yourself
00:53:13    Jordan Krogmann:    lol, I guess I was looking for a reason to lose sleep @tyler
00:53:28    Tony ElHabr:    step_drop_table in dbrecipes could be disastrous
00:54:46    Conor Tompkins: step_rm_rf
00:54:49    Tony ElHabr:    tidymodels before workflows was an experience
00:55:14    Tan Ho: it's superseded by bake NULL
00:55:49    Tony ElHabr:    every time i typed juice() i felt disgusting
00:56:00    Asmae Toumi:    lmfao
00:56:29    Conor Tompkins: Workflow feels like the %>% for tidymodels
00:58:17    Asmae Toumi:    hahahahahahah
00:58:36    Tyler Grant Smith:  no congrats for tan
00:58:39    Tony ElHabr:    thanks y'all
00:59:03    Jordan Krogmann:    Thanks pavitra!
00:59:06    Jordan Krogmann:    great job
00:59:10    Daniel Chen:    my prelims are next week xDDD
00:59:14    Asmae Toumi:    I made parsnip puree this weekend it was sooooooooooo good, so much better than mashed potatoes
00:59:24    Tony ElHabr:    tony, tan, tyler, t-rex
00:59:26    Tony ElHabr:    all the same
00:59:47    Asmae Toumi:    Jordan should go since he’s gonna be busy soon writing dbrecipes
01:00:48    Asmae Toumi:    YESSSSSSSSSSSSSSSSS
01:00:54    Asmae Toumi:    My impact!!!!!
01:01:15    Jordan Krogmann:    lol thanks a lot Asmae!
01:01:45    Tony ElHabr:    thanks so much Pavitra!
01:01:50    Tony ElHabr:    great presentation
01:01:54    Jonathan Trattner:  Great job Pavitra!
01:02:00    Conor Tompkins: Thanks Pavitra!
01:02:00    Andrew G. Farina:   Thank you Pavitra, that was a busy chapter!
01:02:09    Jonathan Trattner:  I’ll deff be asking about prep and bake again (:
01:03:11    Scott Nestler:  I'm surprised that "mise en place" didn't make an appearance in this chapter.
01:03:12    Jonathan Trattner:  I’ll play with it a little bit but thanks!
01:04:48    Daniel Chen:    it'll be less weird when we get to workflow
01:05:09    Jonathan Trattner:  Thank you!
01:06:40    Jordan Krogmann:    Thanks gonna drop guys, it's been great!

8.9.2 Cohort 2

Meeting chat log

00:18:37    Kevin Kent: I found out the other day that you can group tabs in chrome
00:18:51    shamsuddeen:    yes
00:18:57    Stephen Holsenbeck: yes! such a great new feature, I love it
00:18:59    Luke Shaw:  Yes Kevin! I'm a big fan
00:19:10    shamsuddeen:    Now, feature comes in today in Chrome
00:19:12    Kevin Kent: https://blog.google/products/chrome/manage-tabs-with-google-chrome/
00:19:37    Amélie Gourdon-Kanhukamwe (she/they):   Ah, me too just last week, which means I feel even less bad running dozens at a time!
00:20:21    shamsuddeen:    All tab, are automatically group in dropdown. Only available in Chrome Beta. - :)
00:20:30    shamsuddeen:    Released this feature today
00:20:53    Layla Bouzoubaa:    👀
00:30:20    Layla Bouzoubaa:    https://recipes.tidymodels.org/reference/step_YeoJohnson.html
00:32:00    Amélie Gourdon-Kanhukamwe (she/they):   Discretization is reducing numerical variables into ordinal ones, with equal width or equal frequency of bins. My understanding is equal width is preferred.
00:33:38    Amélie Gourdon-Kanhukamwe (she/they):   And that would be because some techniques don't cope would be fully continuous variables, for example Naïve Bayes (as per my understanding after learning machine learning with Weka).
00:34:16    Luke Shaw:  interesting, thanks :)
00:34:18    Stephen - Computer - No Mic:    Interesting, thank you Amelie
00:39:13    Luke Shaw:  I love janitor::clean_names() for fixing that kind of thing
00:39:57    Luke Shaw:  hmmmm there must be a way I agree!
00:40:45    Stephen - Computer - No Mic:    mutate(data, col = gsub("\\+", col, "p"))
00:41:11    Stephen - Computer - No Mic:    or mutate(data, col = stringr::str_replace(col, "\\+", "p"))
00:42:01    Amélie Gourdon-Kanhukamwe (she/they):   Actually, I retract the part about Naïve Bayes *needing* discretization, sorry. Can't name with certainty a classifier that needs discretizing specifically for now, but it is believed to help performance (it didn't much on my only project playing with ML).
00:42:04    Kevin Kent: One gotcha I’ve come across with making dummy variables or one-hot encoding is that the test set has levels of a dummified variable that the train set didn’t have. Argument for having or other methods I think
00:42:20    Kevin Kent: *sometimes has
00:43:13    Kevin Kent: **hashing or other methods
00:43:32    Amélie Gourdon-Kanhukamwe (she/they):   But see eg: https://link.springer.com/content/pdf/10.1007/s10994-008-5083-5.pdf
00:52:38    Amélie Gourdon-Kanhukamwe (she/they):   What if you do this Layla? simple_ames$var_info
00:52:57    Stephen - Computer - No Mic:    Hey shamsuddeen,

I didn't get any issue using the pipe:
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
       data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal()) %>% 
  prep(training = ames_train)
00:53:00    Amélie Gourdon-Kanhukamwe (she/they):   Only if you want to see the list of var?
00:53:22    Stephen - Computer - No Mic:    The prep function takes the recipe as it's first argument
00:53:38    Stephen - Computer - No Mic:    So piping it in to the first argument (as is default with a piple) works fine
00:55:07    shamsuddeen:    Ok, thank you for looking at this. Below is an example that does not works.
00:55:12    shamsuddeen:    simple_ames <- ames_train %>% 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal())
00:55:19    shamsuddeen:    The above works
00:55:35    Stephen - Computer - No Mic:    Yes
00:55:41    shamsuddeen:    The one below does not works
00:55:42    shamsuddeen:    simple_ames <- ames_train %>% 
            prep(simple_ames)
00:56:19    Stephen - Computer - No Mic:    Yeah, simple_ames has not been created yet
00:56:27    Stephen - Computer - No Mic:    You're assigned it in the step
00:56:32    Stephen - Computer - No Mic:    So it can't be used in the step
00:56:40    Stephen - Computer - No Mic:    You've*
00:57:19    Stephen - Computer - No Mic:    The variable simple_ames must exist in the environment by assignment before it can be used in code
00:57:48    shamsuddeen:    But I call them in series
00:57:53    shamsuddeen:    Like below
00:57:58    shamsuddeen:    simple_ames <- ames_train %>% 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal())
00:58:03    shamsuddeen:    simple_ames <- ames_train %>% 
            prep(simple_ames)
00:58:10    Stephen - Computer - No Mic:    Ok
00:58:20    Stephen - Computer - No Mic:    So you're passing in simple_ames as the recipe for prep
00:58:21    shamsuddeen:    The Ames has been created in the first place
00:58:26    Stephen - Computer - No Mic:    So to make that work:
00:58:53    Stephen - Computer - No Mic:    simple_ames <- ames_train %>% 
            {prep(simple_ames, training = .)}
00:59:08    Stephen - Computer - No Mic:    The ames_train data has to be the second argument
00:59:22    Stephen - Computer - No Mic:    So you have to use the brackets and the . to pipe it into that second spot
00:59:29    shamsuddeen:    Yay…it works
00:59:39    shamsuddeen:    Thanks
01:00:10    shamsuddeen:    But, why in recipe we use not this format?
01:00:33    Stephen - Computer - No Mic:    It's just the order of arguments, one moment I'll see if I can find an article on it
01:01:49    Luke Shaw:  Harking back to a previous Q by Layla on name_repair error for the values in a column, I think the function janitor::make_clean_names(col) should do it - cleans the values of the column inputted
01:02:10    shamsuddeen:    Yea, janitor can do that
01:02:27    Stephen - Computer - No Mic:    shamsuddeen:
https://magrittr.tidyverse.org/#the-argument-placeholder
01:02:33    shamsuddeen:    Today session is full of questions -:)
01:02:49    Stephen - Computer - No Mic:    And this: 
https://thatdatatho.com/tutorial-about-magrittrs-pipe-operator-and-placeholders/
01:03:22    Kevin Kent: This fits with my mental model but not what we saw in august’s session https://stackoverflow.com/questions/62189885/what-is-the-difference-among-prep-bake-juice-in-the-r-package-recipes
01:08:14    Stephen - Computer - No Mic:    This is really helpful 👆
01:08:41    shamsuddeen:    Yes,  it makes much much sense
01:09:45    Kevin Kent: People often forgot about the role of domain expertise in feature engineering. I’ve found that it helps prediction a lot and you come up with features that you or an algorithm would have never considered.
01:10:03    Layla Bouzoubaa:    Thanks, Luke!!
01:10:12    Kevin Kent: Especially when you have a large number of possible data sets and features
01:11:04    Stephen - Computer - No Mic:    Definitely
01:13:09    Luke Shaw:  Would step_other cope better with the gotcha Kevin mentioned before? Of a value in test not seen in train
01:13:24    Janita Botha:   Yes we can keep gping!
01:14:05    Kevin Kent: I think as long as the new level isn’t in the top n in the test set @luke
01:15:15    Kevin Kent: The model I was using just outputted it as a warning but I think the consequence is that it kind of ignored that new level, which isn’t ideal
01:15:39    Luke Shaw:  Ah yeah that makes sense, thanks :) In that scenario I guess it a fairly big problem that something common in test data was never seen in train
01:16:14    Kevin Kent: yeah. and particularly challenging with categorical data…but I think embeddings might be the best way to address that for categorical data
01:19:25    Luke Shaw:  Thanks August! :)
01:19:38    shamsuddeen:    Thanks August
01:20:02    Kevin Kent: Thanks! Great discussion and presentatio
01:20:05    Kevin Kent: *presentation
01:22:54    Janita Botha:   Lol
01:22:59    Janita Botha:   We call iy autumn
01:23:00    Layla Bouzoubaa:    Thanks everyone!

8.9.3 Cohort 3

Meeting chat log

00:30:17    Toryn Schafer (she/her):    It says on the juice help page that it is superseded by bake
00:30:47    Ildiko Czeller: https://recipes.tidymodels.org/articles/Ordering.html
00:51:31    Ildiko Czeller: step_pca(matches("(SF$)|(Gr_Liv)"), num_comp = 2)
00:52:01    jiwan:  https://juliasilge.com/blog/best-hip-hop/ Julia Silge has a nice blog post on PCA
01:01:52    jiwan:  ?step_normalize(

8.9.4 Cohort 4

Meeting chat log

00:38:40    Ryan Metcalf:   After loading tidy models package, I ran `recipes::step` and reviewed the drop downs. This doesn’t directly answer your question….but may give some breadcrumbs for further reading. Maybe?
00:39:08    Ryan Metcalf:   @Laura, what was the other package you mentioned? GAM?
00:39:20    Laura Rose: thanks. I wonder if it's not integrated with tidymodels...the s() function that is
00:39:28    Laura Rose: yeah there's mgcv and gam
00:39:33    Laura Rose: I think
00:39:44    Laura Rose: maybe it's mcgv
00:48:19    Isabella Velásquez: Thank you!