24.4 Example: Loading IMDB data
- Load packages
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
- This code chunk may be a bottleneck for people with slower internet connections and for websites which may guard against scraping.
# Problem directly using code from book
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)
- Download HTML file locally instead. Then use
read_html()
directly on the downloaded file.
# Suggestion from https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
html.book <- read_html("scrapedpage.html")
Specific to the Internet Archive is that some of the website snapshots may be available as a link but not necessarily accessible.
May encounter 403 Forbidden error.
# Not just any URL will work
url <- "https://web.archive.org/web/20240223185506/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
- But even if that were the case, the scraped file may part of a dynamic website.
- A very old snapshot is available for comparison.
url <- "https://web.archive.org/web/20040704034814/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage-old.html", quiet=TRUE)
- But the structure of the website and the associated HTML have changed.
- This means the code you see in the book will not work out of the box.
html.old <- read_html("/home/apua/Documents/r4ds/scrapedpage-old.html")
temp <- html.old |>
html_elements("table") |> html_attr("border")
which(temp == "1")
temp <- html.old |>
html_elements("table")
table.old <- html_table(temp[21], header=TRUE)
ratings <- table.old[[1]] |>
select(
rank = "Rank",
title_year = "Title",
rating = "Rating",
votes = "Votes"
) |>
mutate(votes = parse_number(votes)) |>
separate_wider_regex(
title_year,
patterns = c(
title = ".+", " +\\(",
year = "\\d+", "\\)"
)
)
ratings
- Let us compare to the book.
html.book <- read_html("scrapedpage.html")
table.book <- html.book |>
html_element("table") |>
html_table()
ratings.book <- table.book |>
select(
rank_title_year = `Rank & Title`,
rating = `IMDb Rating`
) |>
mutate(
rank_title_year = str_replace_all(rank_title_year, "\n +", " "),
rating_n = html.book |> html_elements("td strong") |> html_attr("title")
) |>
separate_wider_regex(
rank_title_year,
patterns = c(
rank = "\\d+", "\\. ",
title = ".+", " +\\(",
year = "\\d+", "\\)"
)
) |>
separate_wider_regex(
rating_n,
patterns = c(
"[0-9.]+ based on ",
number = "[0-9,]+",
" user ratings"
)
) |>
mutate(
number = parse_number(number)
)
ratings.book$title
- Let point out things about the two datasets which make post-processing challenging.
- I don’t resolve them here, but they require one to think about what questions you want answered first.