More on I/O with arrow (2)

Partitioning large csv files to a Parquet dataset consisting of multiple files:

seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path, format = "parquet")

This will result in a massive difference in performance.

There’s also arrow::to_duckdb() for converting a dataset to DuckDB.

arrow also provides read_csv_arrow(), read_tsv_arrow(), read_delim_arrow() and read_json_arrow().