More on I/O with arrow (1)

open_dataset(): makes a connection to (larger-than-memory) file(s) for lazy querying

  • returns an Arrow Dataset object in R
  • only collects data on collect(). Cf. dbplyr package
  • also useful to access large csv files, although slower than the binary alternatives!

https://r4ds.hadley.nz/arrow.html

if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you