4.5 Regression Diagnostics
4.5.1 Outliers (extreme value) This may not be an influential case
Using the {performance}
package by Daniel Lüdecke, we identify one influential case using Cook’s Distance (Other options are available).
model <- lm(Sale_Price ~ total_sf+ bath + Lot_Area + Bedroom_AbvGr, data = dat)
outliers <- performance::check_outliers(model)
plot(outliers)
## Obs Distance_Cook Outlier_Cook Outlier
## 1 1499 1.178359e+00 1 1
## 2 1 5.037865e-05 0 0
## 3 2 2.824976e-07 0 0
## 4 3 6.201489e-05 0 0
## 5 4 1.308948e-04 0 0
## 6 5 4.909392e-06 0 0
It is hard to tell whether this is a typo or a one off sale. The property sold for $160k but has an almost 64k lot area and over 5.6k square footage–quite a deal in this area.
## # A tibble: 1 × 5
## Sale_Price total_sf bath Lot_Area Bedroom_AbvGr
## <int> <int> <dbl> <int> <int>
## 1 160000 5642 4.5 63887 3
When we remove this influential case, our coefficients change quite a bit.
house_noinfluence <- lm(Sale_Price ~ total_sf+ bath + Lot_Area + Bedroom_AbvGr,
data = dat %>%
slice(1:1498, 1500:n()))
round(cbind(house_lm = model$coefficients,
house_noinfluence = house_noinfluence$coefficients), digits = 3)
## house_lm house_noinfluence
## (Intercept) 28569.669 26596.755
## total_sf 104.848 109.716
## bath 28167.404 27124.051
## Lot_Area 0.622 0.742
## Bedroom_AbvGr -25683.007 -27089.435