12.7 Performance Comparison
12.7.1 Computational speed
expression | min | median | itr/sec | mem_alloc | gc/sec | n_itr | n_gc |
---|---|---|---|---|---|---|---|
summary_tbl | 0.0000 | 1.00e-09 | 8.90e+08 | 0 | 0.00 | 10000 | 0 |
collect(summary_duckdb) | 0.0258 | 2.70e-02 | 3.69e+01 | 516704 | 4.34 | 17 | 2 |
collect(summary_arrow) | 0.1258 | 1.37e-01 | 7.44e+00 | 123880 | 2.48 | 3 | 1 |
12.7.4 Overall guidelines
- If your data is small (i.e., less than a couple hundred megabytes), just use CSV because it’s easy, cross-platform, and versatile.
- If your data is larger than a couple hundred megabytes and you’re just working in R (either by yourself or with a few colleagues), use .rds because it’s space-efficient and optimized for R.
- If your data is around a gigabyte or more and you need to share your data files across different platforms (i.e., not just R but also Python, etc.) and you don’t want to use a SQL-based RDBMS, store your data in the Parquet format and use the arrow package.
- If you want to work in SQL with a local data store, use DuckDB, because it offers more features and better performance than RSQLite, and doesn’t require a server-client architecture that can be cumbersome to set up and maintain.
- If you have access to a RDBMS server (hopefully maintained by a professional database administrator), use the appropriate DBI interface (e.g., RMariaDB, RPostgreSQL, etc.) to connect to it.