Data Project Architecture

Learning objectives

Poorly-architected software is likely to break when you share it or go to production. A standard software arquitechture for apps:

Three layer app

The three layers can help clarify parts of our projects, but:

We might not be designing an app
We usually have projects with many components
Apps run in response to user actions, data science projects run in response to updates to the data
We usually don’t have ownership on the data

It is what our users will consume, so we have to choose it first. Options:

Decision workflow for presentation layer

Don’t pull all the data into your session. Instead:

Push work to the data source: do anything you can do before you pull the data out.
Be lazy with data pulls: pull the data that’s needed when it’s needed.
Sample the data. Makes sense for machine learning tasks but not for counting.
Chunk and pull. Identify natural groups.

It depends on how often it is updated. In the presentation bundle, only if the data and the app will be updated together.

Filesystem: hard for deployment.
Blob storage or pins: cloud storage like S3 from Amazon, Google Storeage or Azure. They usually have packages.
Google sheets: maybe as an intermediate step.

You can also store your intermediate artifacts in .csv, pickle or rds files or use DuckDB.

Data flow chart

Comprehension questions

Do you ever think about your application layer?
What libraries could you use to implement a three-layer architecture in R or Python?
How do you reduce the data requirements for your project?
How do you handle your intermediate artifacts?