3.3 Dealing with missingness
Diagnose AMES housing dataset missing values
ames_raw <- AmesHousing::ames_raw
sum(is.na(ames_raw))## [1] 13997
library(skimr)
# custom skim function to remove some of the quartile data
my_skim <- skim_with(numeric = sfl(p25 = NULL, p75 = NULL))
my_skim(ames_raw) %>%
filter(n_missing > 0)| Name | ames_raw |
| Number of rows | 2930 |
| Number of columns | 82 |
| _______________________ | |
| Column type frequency: | |
| character | 16 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Alley | 2732 | 0.07 | 4 | 4 | 0 | 2 | 0 |
| Mas Vnr Type | 23 | 0.99 | 4 | 7 | 0 | 5 | 0 |
| Bsmt Qual | 80 | 0.97 | 2 | 2 | 0 | 5 | 0 |
| Bsmt Cond | 80 | 0.97 | 2 | 2 | 0 | 5 | 0 |
| Bsmt Exposure | 83 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtFin Type 1 | 80 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| BsmtFin Type 2 | 81 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| Electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| Fireplace Qu | 1422 | 0.51 | 2 | 2 | 0 | 5 | 0 |
| Garage Type | 157 | 0.95 | 6 | 7 | 0 | 6 | 0 |
| Garage Finish | 159 | 0.95 | 3 | 3 | 0 | 3 | 0 |
| Garage Qual | 159 | 0.95 | 2 | 2 | 0 | 5 | 0 |
| Garage Cond | 159 | 0.95 | 2 | 2 | 0 | 5 | 0 |
| Pool QC | 2917 | 0.00 | 2 | 2 | 0 | 4 | 0 |
| Fence | 2358 | 0.20 | 4 | 5 | 0 | 4 | 0 |
| Misc Feature | 2824 | 0.04 | 4 | 4 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|---|---|
| Lot Frontage | 490 | 0.83 | 69.22 | 23.37 | 21 | 68 | 313 | ▇▃▁▁▁ |
| Mas Vnr Area | 23 | 0.99 | 101.90 | 179.11 | 0 | 0 | 1600 | ▇▁▁▁▁ |
| BsmtFin SF 1 | 1 | 1.00 | 442.63 | 455.59 | 0 | 370 | 5644 | ▇▁▁▁▁ |
| BsmtFin SF 2 | 1 | 1.00 | 49.72 | 169.17 | 0 | 0 | 1526 | ▇▁▁▁▁ |
| Bsmt Unf SF | 1 | 1.00 | 559.26 | 439.49 | 0 | 466 | 2336 | ▇▅▂▁▁ |
| Total Bsmt SF | 1 | 1.00 | 1051.61 | 440.62 | 0 | 990 | 6110 | ▇▃▁▁▁ |
| Bsmt Full Bath | 2 | 1.00 | 0.43 | 0.52 | 0 | 0 | 3 | ▇▆▁▁▁ |
| Bsmt Half Bath | 2 | 1.00 | 0.06 | 0.25 | 0 | 0 | 2 | ▇▁▁▁▁ |
| Garage Yr Blt | 159 | 0.95 | 1978.13 | 25.53 | 1895 | 1979 | 2207 | ▂▇▁▁▁ |
| Garage Cars | 1 | 1.00 | 1.77 | 0.76 | 0 | 2 | 5 | ▅▇▂▁▁ |
| Garage Area | 1 | 1.00 | 472.82 | 215.05 | 0 | 480 | 1488 | ▃▇▃▁▁ |
Visualize missing values
vis_miss(ames_raw, sort_miss = TRUE)
Another way to visualize missing values
library(DataExplorer)
ames_raw %>%
plot_missing(missing_only = TRUE)
What is the cause for missing values?
Error in data entry
Values were never recorded
Respondent didn’t provide a numeric/categorical input (common in surveys)