Docker for Data Science

Learning objectives

  • Decide whether a container is the right tool for a given job.
  • Download and run pre-built Docker images.
  • Describe the stages of the Docker container lifecycle.
  • Build simple Dockerfiles for your own projects.

Why containers?

  • Containers are a way to save an entire machine’s state, rather than just project components
    • This extends beyond packages and libraries to the R or Python version itself, as well as any other tools
  • While a container is similar to a VM, they are much more single purpose. They…
    • are quick to start
    • can be used for individual projects or scripts
  • The container’s configuration is code (infrastructure as code!) and so it’s easy to reproduce

https://www.reddit.com/r/ProgrammerHumor/comments/cw58z7/it_works_on_my_machine/

Why containers for Data Science?

  • Containers are mostly used for:
    1. Packaging an environment for someone else to use
    2. Packaging a finished product (project/app/whatever) for archiving, reproducibility, or production

Potential examples:

  • When publishing, instead of providing code, data, and describing the environment used, you can include a Dockerfile so anyone can pick up exactly where you left off
  • You have a project at work that needs to be interacted with every week no matter who looks at it or when
  • You’re publishing an R package and want to test specific features across different base R or package versions

Words of caution

  • Docker is limited to the resources you give it. If your dev machine is less than awesome, your Docker container will be much less than awesome
  • Docker is only allowed access to what you give it and may take some extra work to get running
  • Some workplaces may not be comfortable with Docker
  • Some use-cases may require direct access to the hardware and are incompatible with a container system
    • Sometimes computers do math differently
  • Containers require proper setup!

Diving In

Docker Run

docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

  • docker run tells Docker to run the following image
  • Options are configured as needed
  • IMAGE is configured as user/image when pulling from docker hub (like CRAN but for Docker)

Docker Compose

docker-compose.yml -> docker compose [-f <arg>...] [options] [COMMAND] [ARGS...]

  • The docker-compose file provides a structured way to describe a docker image
  • Easy way to combine multiple services (maybe you want R + Python)
  • Can be used with the run (do something) or up command (be ready to do something)
  • Options are similar to docker run

Container Lifecyle

  • Change the Dockerfile, not the image
  • Images can be shared like code
    • Think Git!
  • There are services to provide private image registries for companies
  • Containers usually auto-pull if it doesn’t exist already

More on Docker Run

  • DockerHub containers are in the form <user>/<name> (alexkgold/plumber)
    • You can tag an image with a version number user>/<name>:<version>
  • --rm to remove the container on kill (probably not for production)
  • -d run in detached mode so the terminal is free for other uses
  • -p <host>:<container> publishes a port from inside the container to outside
  • --name to assign a name of your choice
  • -v <outside/directory>:<inside/directory> to expose a directory
    • ${PWD} is your project directory

Build a Dockerfile

  • Not the same as a docker-compose.yml

  • FROM the base image for the container

  • RUN run any command as though it’s using the terminal

    • If using something fancy, you may need to install it first
  • COPY copy a file from host to container

  • CMD run a command at runtime

  • Containers will rebuild from the top-most command that was changed

Trying out Docker

  1. Try out plumber penguins in your browser
  1. Kill it
  1. Do it again
  1. Poke around
  1. Kill it again

Meeting Videos

Cohort 1