3.3 Dealing with missingness

Diagnose AMES housing dataset missing values

ames_raw <- AmesHousing::ames_raw

sum(is.na(ames_raw))
## [1] 13997
library(skimr)

# custom skim function to remove some of the quartile data
my_skim <- skim_with(numeric = sfl(p25 = NULL, p75 = NULL))

my_skim(ames_raw) %>% 
     filter(n_missing > 0)
Table 3.1: Data summary
Name ames_raw
Number of rows 2930
Number of columns 82
_______________________
Column type frequency:
character 16
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Alley 2732 0.07 4 4 0 2 0
Mas Vnr Type 23 0.99 4 7 0 5 0
Bsmt Qual 80 0.97 2 2 0 5 0
Bsmt Cond 80 0.97 2 2 0 5 0
Bsmt Exposure 83 0.97 2 2 0 4 0
BsmtFin Type 1 80 0.97 3 3 0 6 0
BsmtFin Type 2 81 0.97 3 3 0 6 0
Electrical 1 1.00 3 5 0 5 0
Fireplace Qu 1422 0.51 2 2 0 5 0
Garage Type 157 0.95 6 7 0 6 0
Garage Finish 159 0.95 3 3 0 3 0
Garage Qual 159 0.95 2 2 0 5 0
Garage Cond 159 0.95 2 2 0 5 0
Pool QC 2917 0.00 2 2 0 4 0
Fence 2358 0.20 4 5 0 4 0
Misc Feature 2824 0.04 4 4 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p50 p100 hist
Lot Frontage 490 0.83 69.22 23.37 21 68 313 ▇▃▁▁▁
Mas Vnr Area 23 0.99 101.90 179.11 0 0 1600 ▇▁▁▁▁
BsmtFin SF 1 1 1.00 442.63 455.59 0 370 5644 ▇▁▁▁▁
BsmtFin SF 2 1 1.00 49.72 169.17 0 0 1526 ▇▁▁▁▁
Bsmt Unf SF 1 1.00 559.26 439.49 0 466 2336 ▇▅▂▁▁
Total Bsmt SF 1 1.00 1051.61 440.62 0 990 6110 ▇▃▁▁▁
Bsmt Full Bath 2 1.00 0.43 0.52 0 0 3 ▇▆▁▁▁
Bsmt Half Bath 2 1.00 0.06 0.25 0 0 2 ▇▁▁▁▁
Garage Yr Blt 159 0.95 1978.13 25.53 1895 1979 2207 ▂▇▁▁▁
Garage Cars 1 1.00 1.77 0.76 0 2 5 ▅▇▂▁▁
Garage Area 1 1.00 472.82 215.05 0 480 1488 ▃▇▃▁▁

Visualize missing values

vis_miss(ames_raw, sort_miss = TRUE)

Another way to visualize missing values

library(DataExplorer)

ames_raw %>% 
     plot_missing(missing_only = TRUE)

What is the cause for missing values?

  • Error in data entry

  • Values were never recorded

  • Respondent didn’t provide a numeric/categorical input (common in surveys)