Open the data

  • Size in memory ≈ 2 × size on disk
  • read_csv() ➡️ arrow::open_dataset()
  • Scans a few thousand rows to determine dataset structure
    • ISBN is empty for 80k rows, so we specify
  • Does NOT load entire dataset into memory
seattle_csv <- open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  col_types = schema(ISBN = string()),
  format = "csv"
)