22.18 dplyr and arrow (7)

Performance (3)

  • As shown earlier, data manipulation with parquet files take less than a second versus more than 10 seconds with reading in the entire CSV
  • The speed is due to partitioning and storing data in binary (language computer directly understands)
  • i.e., Arrow only needs the parquet file with 2021 data since it’s partitioned by year and only gets columns used in query