24.4 Example: Loading IMDB data

  • Load packages
library(tidyverse)
library(rvest)
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
  • This code chunk may be a bottleneck for people with slower internet connections and for websites which may guard against scraping.
# Problem directly using code from book
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)
  • Download HTML file locally instead. Then use read_html() directly on the downloaded file.
# Suggestion from https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
html.book <- read_html("scrapedpage.html")
  • Specific to the Internet Archive is that some of the website snapshots may be available as a link but not necessarily accessible.

  • May encounter 403 Forbidden error.

# Not just any URL will work
url <- "https://web.archive.org/web/20240223185506/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  • But even if that were the case, the scraped file may part of a dynamic website.
  • A very old snapshot is available for comparison.
url <- "https://web.archive.org/web/20040704034814/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage-old.html", quiet=TRUE)
  • But the structure of the website and the associated HTML have changed.
  • This means the code you see in the book will not work out of the box.
html.old <- read_html("/home/apua/Documents/r4ds/scrapedpage-old.html")
temp <- html.old |> 
  html_elements("table") |> html_attr("border") 
which(temp == "1")
temp <- html.old |> 
  html_elements("table") 
table.old <- html_table(temp[21], header=TRUE)
ratings <- table.old[[1]] |>
  select(
    rank = "Rank",
    title_year = "Title",
    rating = "Rating",
    votes = "Votes"
  ) |> 
  mutate(votes = parse_number(votes)) |>
  separate_wider_regex(
    title_year,
    patterns = c(
      title = ".+", " +\\(",
      year = "\\d+", "\\)"
    )
  )
ratings
  • Let us compare to the book.
html.book <- read_html("scrapedpage.html")
table.book <- html.book |> 
  html_element("table") |> 
  html_table()
ratings.book <- table.book |>
  select(
    rank_title_year = `Rank & Title`,
    rating = `IMDb Rating`
  ) |> 
  mutate(
    rank_title_year = str_replace_all(rank_title_year, "\n +", " "), 
    rating_n = html.book |> html_elements("td strong") |> html_attr("title")
  ) |> 
  separate_wider_regex(
    rank_title_year,
    patterns = c(
      rank = "\\d+", "\\. ",
      title = ".+", " +\\(",
      year = "\\d+", "\\)"
    )
  ) |>
  separate_wider_regex(
    rating_n,
    patterns = c(
      "[0-9.]+ based on ",
      number = "[0-9,]+",
      " user ratings"
    )
  ) |>
    mutate(
    number = parse_number(number)
  )
ratings.book$title
  • Let point out things about the two datasets which make post-processing challenging.
  • I don’t resolve them here, but they require one to think about what questions you want answered first.
ratings$title
left_join(ratings, ratings.book, by="title")