How can I get data from web pages?

Learning objectives:

  • Determine whether a website allows you to scrape data from it.
  • Determine whether you should scrape data from a web page.
  • Scrape tables from web pages.
  • Scrape more complex data structures from web pages.

(potential chapter break)

  • Scrape data from websites that require you to log in.
  • Scrape content that requires interaction.
  • Automate web scraping processes.
  • Scrape data as part of a workflow.

Should I scrape this data?

Can I legally scrape this data?

  • Personal use or nonprofit education = usually ok
  • Be careful about personally identifiable information (PII)
  • Check legal disclaimers (but may be over-protective)
  • US:
    • Can’t copyright facts,
    • CAN copyright collections of facts in some cases (creative)
  • Other places:
    • Sometimes stricter (EU)
    • Sometimes more lax

Should I scrape this data?

robots.txt

  • User-agent: * = everybody
  • Search for name(s) of package(s)
  • Search for specific pages
  • Check root of site (/) and your particular subfolder
  • These aren’t (necessarily) legally binding

Do I need to scrape this data?

  • Try {datapasta} 📦
    • RStudio Addins
  • If it’s one time & over-complicated, consider other copy/paste strategies
  • Only scrape what you need
  • Look for an API!

How can I scrape tables of data?

Example 1: Single table

(screenshot of table on page)

How can I scrape a single table?

(code demo of scraping a table)

Example 2: Multiple tables

(screenshot of tables on page)

How can I choose a table?

(code demo of scraping one of many tables)

How can I scrape multiple tables?

(purrr)

How can I scrape more complex data?

What is SelectorGadget?

  • ADD CONTENTS

Example 3: Non-tabular data

(screenshot of page with non-tabular data, possibly CSS selector rules)

How can I use SelectorGadget?

(record clicks? also show code where it goes)

What are CSS selectors?

  • (I continue to go back and forth about Xpath vs CSS Selectors. Xpath can traverse back up the tree, which I feel might be vital for some advanced examples; but then again maybe CSS :has() will be enough)

SUBSECTION

  • ADD CONTENTS

SUBSECTION

  • ADD CONTENTS

SUBSECTION

  • ADD CONTENTS

Meeting Videos

Cohort 1

Meeting chat log
LOG