Everything on a computer is represented by a (usually very large) number.
At a hardware level the only thing computers do is add these numbers together.
Modern computers add very quickly and very accurately.
Computing speed is limited by cores and clock
classroom as a computer
Cores: Available processing centers in a CPU (usually 4-32)
Clock speed: number of operations each core can accomplish (2-5 GHz, 2-5 billion operations per sec)
Most recent improvement is a result of number of cores, and core usage, not clocks
moore’s law
For R/Python, fewer faster cores are usually preferable to many slower cores
For servers, keep it light:
\(n \textrm{ cores} = 1 \textrm{ core per user} + 1\)
Specialized chips initially built for graphics
More numerous slower cores than CPU
Where a consumer-grade CPU has 4-16 cores, mid-range GPUs have 700-4,000 cores, but each one runs between 1% and 10% the speed of a CPU core.
Usually, no (parallelising into < 10 workers is often enough)
You’ll know you need a GPU when you get there
Discuss…
decisions, decisions
Basically short term memory. When a process ends, it is destroyed (like when you quit RStudio)
Get a computer with lots of RAM, probably
\(\textrm{Necessary RAM} = \textrm{max amount of data} \times 3\)
Big data in R
Where physical files live and persist
The bigger the better (and cheaper nowadays)
\(\textrm{Necessary Storage} = \textrm{approx data} + 1 \textrm{GB} \times n \textrm{ users}\)
You try to load a big csv file into pandas in Python. It churns for a while and then crashes.
You go to build a new ML model on your data. You’d like to re-train the model once a day, but it turns out training this model takes 26 hours on your laptop.
You design an visualization Matplotlib , and create a whole bunch in a loop, you want to parallelize the operation. Right now you’re running on a t2.small with 1 CPU.
Not much to say here…🤷