Choosing the Right Server for You

Learning objectives

  • Appreciate computation hardware/software
  • CPU vs. GPU
  • RAM vs Storage

What is Computing?

  • Everything on a computer is represented by a (usually very large) number.

  • At a hardware level the only thing computers do is add these numbers together.

  • Modern computers add very quickly and very accurately.

  • Computing speed is limited by cores and clock

classroom as a computer

CPU: The physical brain

  • Cores: Available processing centers in a CPU (usually 4-32)

  • Clock speed: number of operations each core can accomplish (2-5 GHz, 2-5 billion operations per sec)

  • Most recent improvement is a result of number of cores, and core usage, not clocks

moore’s law

How do I go faster…?

  • For R/Python, fewer faster cores are usually preferable to many slower cores

  • For servers, keep it light:

\(n \textrm{ cores} = 1 \textrm{ core per user} + 1\)

GPU

  • Specialized chips initially built for graphics

  • More numerous slower cores than CPU

Where a consumer-grade CPU has 4-16 cores, mid-range GPUs have 700-4,000 cores, but each one runs between 1% and 10% the speed of a CPU core.

  • Great for parallelisation (think Neural Nets)

So Should I Get GPU…?

  • Usually, no (parallelising into < 10 workers is often enough)

  • You’ll know you need a GPU when you get there

  • Discuss…

decisions, decisions

RAM

  • Basically short term memory. When a process ends, it is destroyed (like when you quit RStudio)

  • Get a computer with lots of RAM, probably

\(\textrm{Necessary RAM} = \textrm{max amount of data} \times 3\)

  • If it’s too big, move to disk (e.g. database, Arrow, dask, HDF5)

Big data in R

Disk (Storage)

  • Where physical files live and persist

  • The bigger the better (and cheaper nowadays)

\(\textrm{Necessary Storage} = \textrm{approx data} + 1 \textrm{GB} \times n \textrm{ users}\)

Scenarios

  1. You try to load a big csv file into pandas in Python. It churns for a while and then crashes.

  2. You go to build a new ML model on your data. You’d like to re-train the model once a day, but it turns out training this model takes 26 hours on your laptop.

  3. You design an visualization Matplotlib , and create a whole bunch in a loop, you want to parallelize the operation. Right now you’re running on a t2.small with 1 CPU.

Lab

Not much to say here…🤷

Meeting Videos

Cohort 1