5.6 Multi-level data
It’s important to figure out what the independent experimental unit is in your data. In the Ames dataset, there is one row per house and so houses and their properties are considered to be independent of one another.
In other datasets, there may be multiple rows per experimental unit (e.g. as in patients who are measured multiple times across time). This has implications for data splitting. To avoid data from the same experimental unit being in both the training and test set, split along the independent experimental units such that X% of experimental units are in the training set.
Example:
# data source: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#oxboys
<- read_delim(here::here("data/Oxboys.txt"), col_names = FALSE, delim = " ") %>%
child_heights ::set_names(c("child_id", "age_norm", "height", "measurement_id", "season"))
purrrhead(child_heights)
## # A tibble: 6 × 5
## child_id age_norm height measurement_id season
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -1 140. 1 7.32
## 2 1 -0.747 143. 2 9.36
## 3 1 -0.463 145. 3 0.84
## 4 1 -0.164 147. 4 4.32
## 5 1 -0.002 148. 5 6.24
## 6 1 0.247 150. 6 9.36
Depending on the modeling problem we may want to split the data to train and test set in a way that data points for children remain together. You can do this with the following code.
<- child_heights %>%
child_heights_split group_nest(child_id) %>%
initial_split()
<- training(child_heights_split) %>%
child_heights_train unnest(data)
For other task it might be more suitable to split along measurement id and all childrens’ last measurement will be the test set.