{renv}
(or {venv}
and uv
) to control the package environment.The 5 tenets of DevOps from the intro call:
r-lib
and Github Actions
“Servers as cattle”
Before we dive in, who has experience with managing package environments in R and Python, and what do you guys think about each language’s ecosystem?
Which is simpler/more pleasant to use?
Have you tried in R without RStudio?
Today we will focus mainly on the first layer, the package environment.
The Data Scientist Life Cycle
“This works fine for a while. But the problem with this is that the default has you installing things into a cache that’s shared among every project on your system.”
It’s also considered rude to make changes to a user’s system-level package setup.
More importantly, DevOps aims to help you make things that don’t break, installing packages without versions specified does not prevent things from breaking in the future.
Most of us manage to fit everything we want in our homes, but it’s not all in the same room and we’re not bound to keep our rooms the same over time.
At any given time, we put things in different rooms for a variety of reasons; space and incompatible functions being the main two.
This is also true over time - perhaps even more so - we move rooms around, place things in and out of rooms according to our needs. It would be quite restrictive if we had to have everything, everywhere, all at once.
It would be more restrictive still, if we had to have incompatible things together. Kitchen and toilet together? No thanks
You must choose
‘Why can’t we have multiple version of the same package?’
Because no
When it comes to managing package environments, each package is stored in a /Library folder and we can only have one version of each package in /Library
We can’t have multiple versions of ‘package x’ in /Library, but we can have multiple versions of /Library, each with a different version of ‘package x’
renv::init()
to create a standalone libraryrenv::activate()
this happens automatically if using RStudio projectsinstall.packages("dplyr")
renv::snapshot()
to track versions etc.“* Lockfile written to ‘~/git_repos/bookclub-do4ds/renv.lock’.”
now you have a file called renv.lock which has been updated with the project’s packages and their versions:
renv lock
*see book for python equivalent i.e. virtualenv/venv and requirements.txt
Now that you have your lock file, you can send your project over to a friend/colleague and they should:
setwd('<project-dir>')
e.g. if you made a projects folder in your home directory and a project within your projects folder named my_project: setwd("~/projects/my_project")
renv::init()
renv::restore()
- if this didn’t happen automatically“In the context of a production environment, Conda smashes together the language version, the package management, and (sometimes) the system library management. … I generally recommend people use a tool that’s just for package management, like {venv}, as opposed to an all-in-one tool like Conda.”
“In other cases, your R or Python library might basically just be a wrapper for system libraries. For example, many popular packages for geospatial analysis are just thin language wrappers that call out to the system libraries. In this case, it might be important to be able to maintain a particular version of the underlying system library to ensure that your code runs at all in the future.”
As we have a chapter dedicated to Docker (which I haven’t read) I won’t dive into this here - I expect we’ll have run out of time too.
What did we learn?
Learning Objective | What Did We Learn? |
---|---|
List the three layers of data science environments. | |
Explain why it’s important to control the package environment. | |
Use {renv} (or {venv} ) to control the package environment. |
|
Recognize when it is important to control the system environment. |