9.2 What are the Experimental Unit and the Unit of Prediction?

Nearly 2600 wavelengths are measured each day for two weeks for each of 15 small-scale bioreactors. This type of data forms a hierarchical, or nested, structure in which wavelengths are measured within each day and within each bioreactor. The key characteristic of this is that the data within a nesting is more related than data between nestings. For example, the spectra within a day are more related to each other than between different days, AND the wavelengths within the days are more correlated with each other than wavelengths between reactors.

🤯WOW this is confusing

Interrelated correlations can be visualized through a plot of autocorrelations. In the plot below, we can see that the autocorrelation between wavelengths is different on different days.

(a) Autocorrelations for selected lags of wavelengths for small-scale bioreactor 1 on the first day. (b) Autocorrelations for lags of days for average wavelength intensity for small-scale bioreactor 1.

Figure 9.4 (b) shows the autocorrelations for the first 13 lagged days. Here correlations for the first lag is greater than 0.95, with correlations tailing off fairly quickly.

So what is the unit of prediction? Spectra? Number of days to reach a certain wavelength? Because everything inside a bioreactor is independent of other bioreactors, the unit of analysis is day within a bioreactor. Everything WITHIN a bioreactor is going to have autocorrelation.

The unit of analysis will help us decide how to use cross-validation to evaluate our results honestly. Leave one out or kfold cross validation should be used. We can’t do cross validation on days or wavelengths because they’re not independent units.