β οΈ Notice: I’m referencing Python3 throughout this blog.
At the beginning of your coding journey, you start with simple projects to solidify your understanding. You’re probably downloading packages and exploring different functionalities at this stage but aren’t concerned with dependencies.
Fortunately, we live in an era where code re-usability is prominent and easy to access. We don’t need to worry about reinventing the wheel. We can focus on our projects or add/improve existing libraries. But at some point, dependency management will need to be considered.
Managing dependencies is something you can’t get away from (unless you write all the code yourself and you never ever call an external library π); it’s almost as certain as death and paying taxes. Since this blog is geared towards data practitioners and those who want to get into data, I’ll focus on an essential tool when developing code in Python.
What is a Virtual Environment?
A virtual environment is a self-contained space within your computer that allows you to download different versions of different packages.
Pandas Profiling
Here’s a concrete example. A couple of months ago, I was doing some analysis and wanted to use a Python package I hadn’t used before called Pandas Profiling. I tried to use this package to automate some initial data quality checks.
The version I installed at the time was 3.1.0, and a particular dependency (jinja2 version 3.1.0
) gave me grief.
It turned out that there’s a module in Pandas Profiling named templates.py
that uses jinja2 to create HTML profile reports.
from pandas_profiling import ProfileReport
So when I tried running the line of code above, I ran into this error: ImportError: cannot import name 'escape' from 'jinja2.utils'
. Unfortunately, the escape
function was removed from jinja2 when they released version 3.1.0.
There were a couple of ways to resolve the issue, and I decided to downgrade jinja2 to version 3.0.3; this was the latest version that still included escape
.
Fortunately, I created a virtual environment before starting my project. Imagine if I had worked on another project in early 2022 that relied on jinja2 version 3.1.0 but didn’t give me any errors. I’d continue without care until I was faced with the pandas_profiling
fiasco.
What would I have done in that situation? Downgrading jinja2 could have adverse effects on the earlier project. Still, I needed to downgrade to solve the dependency issue.
Hence why, virtual environments are essential. They allow you to separate Python packages and install the versions you need. Without them, we would be one step closer to falling into dependency hell. Our computer would be a nightmare of conflicting packages! We’d rather have the nightmare localized in a smaller folder.
Visualize your libraries and their dependencies as wires. Would you want to deal with a tangled mess of conflicting packages? Where would you even begin to fix the problem?
Virtual environments allow us to store and call libraries without modifying our computer’s main site-package directory. So if we mess up with the dependencies, at least it’s at a smaller scale π .
Reproducibility
Using virtual environments also allows us to have a reproducible workspace. Data practitioners like Data Scientists and ML Engineers will need to be able to share their work/findings with their peers and stakeholders.
Having the mindset of: but it worked my machine is π©. Don’t be that person! You weren’t hired to have models and algorithms only work on your computer.
How do Virtual Environments Work?
When you create a virtual environment, your local machine’s main Python folder structure is copied into the current working directory. A sub-directory will appear and in that sub-folder is another sub-directory called bin (a lot of nested folders). Python will create a symbolic link to the system’s main Python folder within the bin directory.
In the example below, I named my virtual environment env.
Highlighted in purple is the activate
script; this will activate your virtual environment. There may be several activate scripts, but you’ll need to run the version compatible with your shell. For example, I use a bash shell so activate
is what I’ll use. If I was using PowerShell, then Activate.ps1
would be used.
Highlighted in blue is the lib
folder; this will hold your local python version, packages, and modules. Highlighted in orange is the site-packages
folder; this is where our installed packages and modules go.
Whenever we use an import statement in our Python scripts, the virtual environment will look for the lib
folder relative to its PATH
when searching for the site-packages
directory.
In this case ./lib/python3.8/
will be used.
|____env
| |____bin
| | |____activate
| | |____pyproj
| | |____pyftsubset
| | |____pip3.8
| | |____jupyter-run
| |____lib
| | |____python3.8
| | | |____site-packages
π Note: a symbolic link (also called a symlink) is a shortcut to a directory. It’s a great way to simplify access to a folder since you don’t need to worry about typing absurdly long paths!
π Note:PATH is an environment variable that contains the locations of executable commands; this is where our virtual environments will search when they need to pull in a module or package.
How do We Create Virtual Environments?
There are three methods I can think of off the top of my head.
- Anaconda
- venv
- virtualenv
venv
is the one I use most often, and it comes with the standard Python library; you don’t need to install it since you got it when you first downloaded Python to your computer.
Assuming you’re in a different directory that’ll store your Python project, you’d call: python3 -m venv env
to create a virtual environment named env
.
To activate env in bash call: source venv/bin/activate
on PowerShell it’s: .\venv\Scripts\Activate.ps1
. To deactivate the current session, we call: deactivate
in our terminal.
virtualenv
is an external library that was created to address some of the gripes associated with venv
βfor example, permission handling and having slower execution times.
You’d call: virtualenv -p python3.9 env
to create a virtual environment named env
that will use Python version 3.9 and source env/bin/activate
to activate it (via bash). Similar to venv
we call deactivate
to deactivate the current session.
You can read more about virtualenv
here. However, if you’re relatively new to programming, I’d suggest sticking with venv
or Anaconda.
For venv
and virtualenv
, you could use a package manager like pip
to install packages.
Anaconda is a commercial distribution of Python (there is a free tier for students and hobbyists). You’re more likely to encounter it in a university setting or industry since it is adored by data practitioners and companies worldwide. It uses conda
for handling environments and dependencies. You can think of conda
as a fusion between pip
and venv
.
We can call: conda create --name myenv python=3.x
to create a virtual environment called myenv
using a Python version of our choice. We then call conda activate myenv
to activate myenv
and conda deactivate
to deactivate the current session. You can learn more about Anaconda here.
Pip or Conda?
Generally, I’d recommend sticking to one package manager when downloading custom libraries. But sometimes rules are broken, and that’s okay π.
Virtual environments are crucial for having clean, reproducible Python workspaces. Fortunately, the learning curve isn’t steep, and the more you practice using it, the more it’ll become habitual.
If this is your first time working with virtual environments, I recommend sticking with venv
. If you want to get into data science/machine learning, start with venv
and then work your way to Anaconda.
Happy coding ππΎ.