24.4 Example: Loading IMDB data

Load packages

library(tidyverse)
library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

This code chunk may be a bottleneck for people with slower internet connections and for websites which may guard against scraping.

# Problem directly using code from book
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)

Download HTML file locally instead. Then use read_html() directly on the downloaded file.

# Suggestion from https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
html.book <- read_html("scrapedpage.html")

Specific to the Internet Archive is that some of the website snapshots may be available as a link but not necessarily accessible.
May encounter 403 Forbidden error.

# Not just any URL will work
url <- "https://web.archive.org/web/20240223185506/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)

But even if that were the case, the scraped file may part of a dynamic website.
A very old snapshot is available for comparison.

url <- "https://web.archive.org/web/20040704034814/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage-old.html", quiet=TRUE)

But the structure of the website and the associated HTML have changed.
This means the code you see in the book will not work out of the box.

html.old <- read_html("/home/apua/Documents/r4ds/scrapedpage-old.html")
temp <- html.old |> 
  html_elements("table") |> html_attr("border") 
which(temp == "1")
temp <- html.old |> 
  html_elements("table") 
table.old <- html_table(temp[21], header=TRUE)
ratings <- table.old[[1]] |>
  select(
    rank = "Rank",
    title_year = "Title",
    rating = "Rating",
    votes = "Votes"
  ) |> 
  mutate(votes = parse_number(votes)) |>
  separate_wider_regex(
    title_year,
    patterns = c(
      title = ".+", " +\\(",
      year = "\\d+", "\\)"
    )
  )
ratings

Let us compare to the book.

html.book <- read_html("scrapedpage.html")
table.book <- html.book |> 
  html_element("table") |> 
  html_table()
ratings.book <- table.book |>
  select(
    rank_title_year = `Rank & Title`,
    rating = `IMDb Rating`
  ) |> 
  mutate(
    rank_title_year = str_replace_all(rank_title_year, "\n +", " "), 
    rating_n = html.book |> html_elements("td strong") |> html_attr("title")
  ) |> 
  separate_wider_regex(
    rank_title_year,
    patterns = c(
      rank = "\\d+", "\\. ",
      title = ".+", " +\\(",
      year = "\\d+", "\\)"
    )
  ) |>
  separate_wider_regex(
    rating_n,
    patterns = c(
      "[0-9.]+ based on ",
      number = "[0-9,]+",
      " user ratings"
    )
  ) |>
    mutate(
    number = parse_number(number)
  )
ratings.book$title

Let point out things about the two datasets which make post-processing challenging.
I don’t resolve them here, but they require one to think about what questions you want answered first.

ratings$title
left_join(ratings, ratings.book, by="title")