4.9 Linear regression with count data - heteroscedasticity | Introduction to Statistical Learning Using R Book Club

4.9 Linear regression with count data - heteroscedasticity

In this example, the variance of biker numbers changes as the mean number changes:

during worse conditions, there are few bikers, and little variation in the number of bikers
during better conditions, there are many bikers on average, but also larger variation in the number of bikers

_Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers. A smoothing spline fit is shown in green. Right: The log of the number of bikers is displayed on the y-axis._

Figure 4.14: Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers. A smoothing spline fit is shown in green. Right: The log of the number of bikers is displayed on the y-axis.

Problem 2: observed heteroscedasticity is a violation of linear model assumptions

$Y = \beta_{0} + \sum_{j=1}^p \beta_{j} + \epsilon$

where $\epsilon$ is a mean-zero error term with a constant variance

Transforming to log improves the variance, but cannot be used where the response can take on a 0 value.

Log transformation also results in challenges in interpretation:

e.g. “a one-unit increase in $X_j$ is associated with an increase in the mean of the log of $Y$ by an amount $β_j$ ”