10.1 Mapping with non-spatial regression and ML models

Let’s start with loading the North Carolina SIDS data - nc spatial data.

library(tidyverse)
library(sf)

nc <- system.file("gpkg/nc.gpkg", package="sf") |>
  read_sf()

It contains information about 100 counties of North Carolina, and includes counts of numbers of live births (also non-white live births) and numbers of sudden infant deaths, for the July 1, 1974 to June 30, 1978 and July 1, 1979 to June 30, 1984 periods. We are interested in:

Sudden Infant Death Syndrome, 1974-78 (SIDS)
Live births, 1974-78 (BIR74)
Non-white births, 1974-78 (NWBIR74)

Useful Resources:

Data description: https://geodacenter.github.io/data-and-lab/sids/
Vignette: https://r-spatial.github.io/spdep/articles/sids.html
More datasets: https://geodacenter.github.io/data-and-lab/

nc %>% names

##  [1] "AREA"      "PERIMETER" "CNTY_"     "CNTY_ID"   "NAME"      "FIPS"     
##  [7] "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"    
## [13] "SID79"     "NWBIR79"   "geom"

nc %>%
  dplyr::select(SID74, BIR74, NWBIR74) %>%
  head()

## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## Geodetic CRS:  NAD27
## # A tibble: 6 × 4
##   SID74 BIR74 NWBIR74                                                       geom
##   <dbl> <dbl>   <dbl>                                         <MULTIPOLYGON [°]>
## 1     1  1091      10 (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.2…
## 2     0   487      10 (((-81.23989 36.36536, -81.24069 36.37942, -81.26284 36.4…
## 3     5  3188     208 (((-80.45634 36.24256, -80.47639 36.25473, -80.53688 36.2…
## 4     1   508     123 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288 36.33…
## 5     9  1421    1066 (((-77.21767 36.24098, -77.23461 36.2146, -77.29861 36.21…
## 6     7  1452     954 (((-76.74506 36.23392, -76.98069 36.23024, -76.99475 36.2…

This type of data can be manipulated as needed to make the model analysis.

nc1 <- nc |> mutate(SID = SID74 / BIR74, NWB = NWBIR74 / BIR74)

nc1 %>% dplyr::select(SID, NWB) %>% head()

## Simple feature collection with 6 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## Geodetic CRS:  NAD27
## # A tibble: 6 × 3
##        SID     NWB                                                          geom
##      <dbl>   <dbl>                                            <MULTIPOLYGON [°]>
## 1 0.000917 0.00917 (((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.2735…
## 2 0        0.0205  (((-81.23989 36.36536, -81.24069 36.37942, -81.26284 36.4050…
## 3 0.00157  0.0652  (((-80.45634 36.24256, -80.47639 36.25473, -80.53688 36.2567…
## 4 0.00197  0.242   (((-76.00897 36.3196, -76.01735 36.33773, -76.03288 36.33598…
## 5 0.00633  0.750   (((-77.21767 36.24098, -77.23461 36.2146, -77.29861 36.21153…
## 6 0.00482  0.657   (((-76.74506 36.23392, -76.98069 36.23024, -76.99475 36.2355…

Making a model with spatial data is as the same as with dealing with dataframes.

$\text{observed = explainer + remainder}$

$Y= (\beta_0+\beta_1x) +\epsilon$

The first step is to apply a linear regression model to our spatial data.

fit <- lm(SID ~ NWB, nc1)
broom::tidy(fit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 0.000677  0.000233      2.91 4.47e- 3
## 2 NWB         0.00438   0.000620      7.06 2.44e-10

The result of the prediction() function releases the fit, lwr and upr vectors.

pr <- fit %>%
  predict(nc1, interval = "prediction")

bind_cols(nc, pr) |> names()

##  [1] "AREA"      "PERIMETER" "CNTY_"     "CNTY_ID"   "NAME"      "FIPS"     
##  [7] "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"    
## [13] "SID79"     "NWBIR79"   "geom"      "fit"       "lwr"       "upr"

Cross-validation techniques can be applied taking consideration of the spatial data autocorrelation, which is mostly due to proximity of spatial points or locations with same structure. Due to this, classical methods of applying cross-validation on spatial data might lead to overestimating results, in order to overcome this more specialized cross-validation techinques have been provided to deal well with spatial data, such as:

spatialsample (Silge and Mahoney 2023)
CAST (Meyer, Milà, and Ludwig 2023)
mlr3spatial (Becker and Schratz 2022)
mlr3spatiotempcv (Schratz and Becker 2022)