12.7 Performance Comparison

12.7.1 Computational speed

res <- read_rds('./data/res.rds')

res |> 
  select(1:8) |> 
  knitr::kable()
expression min median itr/sec mem_alloc gc/sec n_itr n_gc
summary_tbl 0.0000 1.00e-09 8.90e+08 0 0.00 10000 0
collect(summary_duckdb) 0.0258 2.70e-02 3.69e+01 516704 4.34 17 2
collect(summary_arrow) 0.1258 1.37e-01 7.44e+00 123880 2.48 3 1

12.7.2 Memory footprint

#        tbl      arrow     duckdb 
# 2004855136        504      51352 

12.7.3 Disk storage footprint

# A tibble: 3 × 2
#   format    footprint
#   <chr>   <fs::bytes>
# 1 duckdb        1.95G
# 2 parquet     350.46M
# 3 rds         211.92M

12.7.4 Overall guidelines

  • If your data is small (i.e., less than a couple hundred megabytes), just use CSV because it’s easy, cross-platform, and versatile.
  • If your data is larger than a couple hundred megabytes and you’re just working in R (either by yourself or with a few colleagues), use .rds because it’s space-efficient and optimized for R.
  • If your data is around a gigabyte or more and you need to share your data files across different platforms (i.e., not just R but also Python, etc.) and you don’t want to use a SQL-based RDBMS, store your data in the Parquet format and use the arrow package.
  • If you want to work in SQL with a local data store, use DuckDB, because it offers more features and better performance than RSQLite, and doesn’t require a server-client architecture that can be cumbersome to set up and maintain.
  • If you have access to a RDBMS server (hopefully maintained by a professional database administrator), use the appropriate DBI interface (e.g., RMariaDB, RPostgreSQL, etc.) to connect to it.