3.3 Dealing with missingness
Diagnose AMES
housing dataset missing values
<- AmesHousing::ames_raw
ames_raw
sum(is.na(ames_raw))
## [1] 13997
library(skimr)
# custom skim function to remove some of the quartile data
<- skim_with(numeric = sfl(p25 = NULL, p75 = NULL))
my_skim
my_skim(ames_raw) %>%
filter(n_missing > 0)
Name | ames_raw |
Number of rows | 2930 |
Number of columns | 82 |
_______________________ | |
Column type frequency: | |
character | 16 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Alley | 2732 | 0.07 | 4 | 4 | 0 | 2 | 0 |
Mas Vnr Type | 23 | 0.99 | 4 | 7 | 0 | 5 | 0 |
Bsmt Qual | 80 | 0.97 | 2 | 2 | 0 | 5 | 0 |
Bsmt Cond | 80 | 0.97 | 2 | 2 | 0 | 5 | 0 |
Bsmt Exposure | 83 | 0.97 | 2 | 2 | 0 | 4 | 0 |
BsmtFin Type 1 | 80 | 0.97 | 3 | 3 | 0 | 6 | 0 |
BsmtFin Type 2 | 81 | 0.97 | 3 | 3 | 0 | 6 | 0 |
Electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
Fireplace Qu | 1422 | 0.51 | 2 | 2 | 0 | 5 | 0 |
Garage Type | 157 | 0.95 | 6 | 7 | 0 | 6 | 0 |
Garage Finish | 159 | 0.95 | 3 | 3 | 0 | 3 | 0 |
Garage Qual | 159 | 0.95 | 2 | 2 | 0 | 5 | 0 |
Garage Cond | 159 | 0.95 | 2 | 2 | 0 | 5 | 0 |
Pool QC | 2917 | 0.00 | 2 | 2 | 0 | 4 | 0 |
Fence | 2358 | 0.20 | 4 | 5 | 0 | 4 | 0 |
Misc Feature | 2824 | 0.04 | 4 | 4 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p50 | p100 | hist |
---|---|---|---|---|---|---|---|---|
Lot Frontage | 490 | 0.83 | 69.22 | 23.37 | 21 | 68 | 313 | ▇▃▁▁▁ |
Mas Vnr Area | 23 | 0.99 | 101.90 | 179.11 | 0 | 0 | 1600 | ▇▁▁▁▁ |
BsmtFin SF 1 | 1 | 1.00 | 442.63 | 455.59 | 0 | 370 | 5644 | ▇▁▁▁▁ |
BsmtFin SF 2 | 1 | 1.00 | 49.72 | 169.17 | 0 | 0 | 1526 | ▇▁▁▁▁ |
Bsmt Unf SF | 1 | 1.00 | 559.26 | 439.49 | 0 | 466 | 2336 | ▇▅▂▁▁ |
Total Bsmt SF | 1 | 1.00 | 1051.61 | 440.62 | 0 | 990 | 6110 | ▇▃▁▁▁ |
Bsmt Full Bath | 2 | 1.00 | 0.43 | 0.52 | 0 | 0 | 3 | ▇▆▁▁▁ |
Bsmt Half Bath | 2 | 1.00 | 0.06 | 0.25 | 0 | 0 | 2 | ▇▁▁▁▁ |
Garage Yr Blt | 159 | 0.95 | 1978.13 | 25.53 | 1895 | 1979 | 2207 | ▂▇▁▁▁ |
Garage Cars | 1 | 1.00 | 1.77 | 0.76 | 0 | 2 | 5 | ▅▇▂▁▁ |
Garage Area | 1 | 1.00 | 472.82 | 215.05 | 0 | 480 | 1488 | ▃▇▃▁▁ |
Visualize missing values
vis_miss(ames_raw, sort_miss = TRUE)
Another way to visualize missing values
library(DataExplorer)
%>%
ames_raw plot_missing(missing_only = TRUE)
What is the cause for missing values?
Error in data entry
Values were never recorded
Respondent didn’t provide a numeric/categorical input (common in surveys)