29.14 gapminder
Where do I begin to express the awesomness of Hans Rosling (27 July 1948 – 7 February 2017). I HIGHLY encourage you to watch the youtube video!
Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four
Furthermore, we can thank Jenny Bryan for authoring the gapminder package!
To gain some insight to the data, we ask the question: “How does life expectancy (lifeExp) change over time (year) for each country (country)?”
At first glance, it appears life expectancy is increasing…but not for all countries.
To make it easier to view, we fit a model with a linear trend. The model captures steady growth over time, and the residuals will show what’s left.
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(nz_mod) %>%
ggplot(aes(year, pred)) +
geom_line() +
ggtitle("Linear trend + ")
nz %>%
add_residuals(nz_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", linewidth = 3) +
geom_line() +
ggtitle("Remaining pattern")
This is good….but how do we do this for every country?
29.14.1 Nested data
To make life easier for us (and ensure we are not copying and pasting forever), we use the map
function from the purrr
package.
Here, we are going to make a nested data frame.
This creates a data frame that has one row per group (per country), and a rather unusual column:data
. data
is a list of data frames (or tibbles, to be precise).
Note: Don’t use the Structure function
str()
as it will be difficult to view. Instead, just view a single line of your nested dataframe.
Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group.
29.14.2 List-columns
Now, lets fit some models!
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
models <- map(by_country$data, country_model)
Yet, this may be too costly, so instead, lets modify (be more elegant) with our data. Storing related objects in columns is a key part of the value of data frames.
This is a huge advantage: because all the related objects are stored together, you don’t need to manually keep them in sync when you filter or arrange.
If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
29.14.3 Unnesting
Previously, we were working with only one country. Now, lets compute the residuals of ALL countries.
By now, you should ask yourself: “How can I plot a bunch of dataframes?”. This isn’t required. We have to un-nest.
Now, we can plot ALL the residuals.
resids %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
We can now plot each country as a facet.
resids %>%
ggplot(aes(year, resid, group = country)) +
geom_line(alpha = 1 / 3) +
facet_wrap(~continent)
We can note that we still have very large residuals, namely in Africa suggesting our model isn’t fitting very well.
29.14.4 Model quality
Instead of looking at the residuals from the model, we could look at some general measurements of model quality.
We’ll use broom::glance()
to extract some model quality metrics.
We can use mutate() and unnest() to create a data frame with a row for each country.
This isn’t quite the output we want, because it still includes all the list columns. This is default behaviour when unnest()
works on single row data frames. To suppress these columns we use .drop = TRUE
Note,
.drop=TRUE
has been depricated.
glance <- by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE)
glance
Now, we can look for countries that don’t fit our model well.
The worst fitting models seem to be in Africa. We’ll add geom_jitter()
to make it more apparent.
We can also plot the particular bad \[R^2\] and plot the data.
glance %>%
ggplot(aes(continent, r.squared)) +
geom_jitter(width = 0.5)
bad_fit <- filter(glance, r.squared < 0.25)
gapminder %>%
semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
The relation of this visual is the tragidies of HIV/AIDs epidemic and the Rwanda genocide.