5.6 Multi-level data

It’s important to figure out what the independent experimental unit is in your data. In the Ames dataset, there is one row per house and so houses and their properties are considered to be independent of one another.

In other datasets, there may be multiple rows per experimental unit (e.g. as in patients who are measured multiple times across time). This has implications for data splitting. To avoid data from the same experimental unit being in both the training and test set, split along the independent experimental units such that X% of experimental units are in the training set.

Example:

# data source: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#oxboys
child_heights <- read_delim(here::here("data/Oxboys.txt"), col_names = FALSE, delim = " ") %>%
  purrr::set_names(c("child_id", "age_norm", "height", "measurement_id", "season"))
head(child_heights)

## # A tibble: 6 × 5
##   child_id age_norm height measurement_id season
##      <dbl>    <dbl>  <dbl>          <dbl>  <dbl>
## 1        1   -1       140.              1   7.32
## 2        1   -0.747   143.              2   9.36
## 3        1   -0.463   145.              3   0.84
## 4        1   -0.164   147.              4   4.32
## 5        1   -0.002   148.              5   6.24
## 6        1    0.247   150.              6   9.36

Depending on the modeling problem we may want to split the data to train and test set in a way that data points for children remain together. You can do this with the following code.

child_heights_split <- child_heights %>% 
  group_nest(child_id) %>%
  initial_split()

child_heights_train <- training(child_heights_split) %>% 
  unnest(data)

For other task it might be more suitable to split along measurement id and all childrens’ last measurement will be the test set.