Code promotion and integration

Learning objectives

  • Describe the three environments used in software development and data science.
  • Use GitHub Actions (GHA) to deploy data science assets.
  • Keep environments in sync using infrastructure as code (IaC) tooling.

Why Do We Care About Code Promotion Workflows?

(Image Source: Reddit)
  • Without foresight, live products can break. Users sad; supervisors mad.
  • Code promotion reduces the risk of disaster by:
    • Modularizing processes
    • Testing & checking rigorously
    • Minimizing downtime

What does code promotion look like?

The Three Environments

(Image Source: Miami University)

Dev

  • The development environment is the product sandbox
  • Most “data science” happens here:
    • Data analysis & modeling
    • App prototyping
    • ETL

Comparing dev for data science vs dev for software engineering:

Data Science Software Engineering
Goal Explore relationships in data that may develop into live products Build & implement a specific feature for a live product with pre-defined requirements
Tools “Fully fledged” data science IDE (RStudio, VSCode) can encompass Dev, Test, & Prod Dev, Test, & Prod are differentiated by environments & containers
Most of what data scientists do doesn’t end up as a live product in the state it was created; these entities think differently! (Source)

Test

  • Test is for testing :)

  • Tests used for many reasons incl. security, portability, performance, usability

Prod

  • Gold standard (where your live product is released into the wild)
  • Should be guarded by Continuous Integration/Continuous Deployment (CI/CD)
  • Ideally zero manual interaction and zero changes to the actual code

CI/CD

  • Most CI/CD processes are usually managed with git:
There are a handful of git workflow patterns for DevOps (Image Source: The book!)

CI/CD processes are “triggered” by git changes (i.e. when code changes in test)

  • GitHub is the leading provider for CI/CD via GitHub Actions (GHA)

How Does it Work?

You write code that tells the CI/CD tool to:

  1. Build a clean, empty server on the cloud
  2. Copy your code with new changes and the bare-minimum requirements for it to run
  3. Install and run any tests as specified; if tests fail, stop immediately and inform the developer
  4. Accept the new changes and “push” to production (automatically copy the changes to the live product)

r-lib 📦 is your friend for getting started with GHA with R — See here

Per-Environment Configuration

  • When servers are “stood up” by CI/CD, they can take many forms
  • It’s wise to test these forms for your users
  • How do you flexibly code CI/CD to test many forms? config 📦 in R is your friend, use it to set environment variables1

Creating & Maintaining Identical Environments

  • Servers should be cattle; not sheep/environments are pocket change
    • Servers should be unremarkable, used frequently and interchangeably
  • Test environments should be identical to production; therefore, never fiddle with test when tests fail
    • Doing so introduces servers to drift out of alignment
  • Infrastructure as Code (IaC) are tools meant to manage these servers and changes

Review

Learning Objective What Did We Learn?
Describe the three environments used in software development and data science.
  1. Dev: where products are conceived and built
  2. Test: where products are rigorously tested against the many different environments they will encounter “in the wild”
  3. Prod: where the live product exists “in the wild”
Use GitHub Actions (GHA) to deploy data science assets. R users can do this with r-lib 📦
Keep environments in sync using infrastructure as code (IaC) tooling.
  • Servers are cheap; use frequently
  • IaC Tools help you stay away from drifting

Meeting Videos

Cohort 1

Meeting chat log
00:44:58    priyanka gagneja:   fantastic question
00:47:25    priyanka gagneja:   and dates