5.8 Summary
Data splitting is an important part of a modeling workflow as it impacts model validity and performance. The most common splitting technique is random splitting. Some data, such as time-series or multi-level data require a different data splitting technique called stratified sampling. The rsample
package contains many functions that can perform random splitting and stratified splitting.
We will learn more about how to remedy certain issues such as class imbalance, bias and overfitting in Chapter 10.
5.8.1 References
- Tidy modeling with R by Max Kuhn and Julia Silge: https://www.tmwr.org/splitting.html
- Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson: https://bookdown.org/max/FES/
- Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels: https://juliasilge.com/blog/himalayan-climbing/
- Data preparation and feature engineering for machine learning: https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
- How to Build a Machine Learning Model by Chanin Nantasenamat: https://towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1