Part 2 of the “Docker for Data Professionals” Series

Welcome back! If you completed the homework from Part 1, you’ve already run your first containers. Pretty cool, right? You pulled a Jupyter environment with one command. No conda, no pip, just instant setup.
But here’s where it gets really powerful: building your own custom images. This is where Docker transforms from “neat tool” to “how did I ever work without this?”
Today we’re going hands-on. By the end of this post, you’ll understand the essential Docker commands, build your first custom image, and containerize a real data science workflow. Let’s dive in! ðŊ
Your Go-To Docker Commands
Before we build anything, let’s master the commands you’ll use daily. Think of these as your Docker vocabulary, learn these and you can navigate any container workflow.
Working with Images
# Pull an image from a registry
docker pull python:3.9.18
# List all images on your system
docker images
# Remove an image (free up space!)
docker rmi python:3.9.18
# Remove all unused images
docker image prune -a
docker image prune -a
weekly to clean up unused images.
Your SSD will thank you! ðū
Managing Containers
# Run a container (creates and starts it)
docker run python:3.9.18
# List running containers
docker ps
# List ALL containers (including stopped ones)
docker ps -a
# Stop a running container
docker stop <container_id>
# Start a stopped container
docker start <container_id>
# Remove a container
docker rm <container_id>
# Remove all stopped containers
docker container prune
Debugging & Inspection
# View container logs (super useful for debugging!)
docker logs <container_id>
# Execute a command inside a running container
docker exec -it <container_id> bash
# View detailed container info
docker inspect <container_id>
Real-world scenario: My training script crashed inside a container. Instead of rebuilding everything, I ran docker logs <container_id>
to see the error, then docker exec -it <container_id> bash
to poke around and debug. It saved me a lot of time.
Understanding Docker Tags
Remember in Part 1 when I mentioned tags are crucial for reproducibility? Let’s dig deeper. This section could save you hours of debugging.
Docker tags are version labels for images. It’s essential version control for your Docker images. They follow the format image_name:tag
. For example:
python:3.9.18
– this is a specific Python version- tensorflow/tensorflow:2.13.0-gpu – this is a specific TensorFlow with GPU support
my-model:latest
– this is dangerous ðĻ (we’ll get to why!)
The “Latest” Tag Trap ðŠĪ
I’m posting this during spooky season, so here’s a ghost story about latest
tags:
I was working with a team deploying a sentiment analysis model. In development, we used FROM python:3.9
in our Dockerfile. Everything worked perfectly. We tested, validated, and deployed to production.
A few months later (without changing any code) production started failing. Customer complaints rolled in. Predictions were suddenly different. Everything was going downhill really fast!
It took such a long time to find such a tiny bug.
What happened? The python:3.9
tag (without a specific patch version) had been updated from Python 3.9.16 to 3.9.18. A subtle change in how Python 3.9.18 handled floating-point operations caused different random seeds, which cascaded through our entire preprocessing pipeline.
The fix? Change one line: FROM python:3.9
â FROM python:3.9.16
Scary huh? Who would’ve thought that something as small as a patch version could cause so much distress.
You might also be wondering what a patch version is. Let’s unpack that.
Semantic Versioning 101
Let’s decode what those version numbers actually mean. You’ve seen tags like python:3.9.18
or tensorflow:2.13.0
, but what do those numbers represent?
This is semantic versioning (semver), and it follows a simple pattern: MAJOR.MINOR.PATCH
tensorflow:2.13.0
â â â
â â ââ PATCH (bug fixes only)
â âââââ MINOR (new features, backward-compatible)
âââââââ MAJOR (breaking changes)
Here’s what each number means:
- MAJOR (2): Breaking changes. Code that worked in v1.x might break in v2.x
- MINOR (13): New features added, but backward-compatible. v2.12 code works in v2.13
- PATCH (0): Bug fixes only. No new features, no breaking changes.
Why this matters for Docker:
When you use python:3.9
, you’re saying “give me the latest version of Python 3.9.x available at build time”. On a future rebuild, that tag may point to a newer patch release, so you can get different behavior without changing your code.
This is what I experienced. We assumed the pipeline would be using python:3.9.16
but there was an update and when our Docker image was rebuilt it used the latest version which was python:3.9.18
.
When you use python:3.9.16
, you’re locked to that exact version. Reproducibility guaranteed!
Semantic versioning isn’t just for Docker. You’ll see it everywhere in data work:
- Python packages:
pandas==2.0.3
- APIs:
v2.1.4
- ML models:
sentiment-classifier:1.2.0
- Your own pipelines!
Tag Best Practices
Now that you understand version numbers, here are the rules:
â DO: Use specific version tags (all three numbers!)
FROM python:3.9.18
FROM tensorflow/tensorflow:2.13.0-gpu
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
â DON’T: Use vague or latest tags in production
FROM python:3.9 # Which 3.9.x? Could change!
FROM tensorflow:latest # Latest today â latest tomorrow
FROM ubuntu # Defaults to latest (bad!)
When to use latest
? Use it during quick local experiments where reproducibility doesn’t matter.
Anatomy of a Dockerfile ðĐŧ
A Dockerfile is just a text file with instructions. Let’s break down the key commands you’ll use constantly.
Core Dockerfile Instructions
# Start from a base image
FROM python:3.9.18
# Set the working directory inside the container
WORKDIR /app
# Copy files from your machine to the container
COPY requirements.txt .
COPY train.py .
COPY data/ ./data/
# Run commands (installing packages, etc.)
RUN pip install --no-cache-dir -r requirements.txt
# Set environment variables
ENV MODEL_PATH=/app/models
# Expose a port (for APIs, Jupyter, etc.)
EXPOSE 8000
# Default command to run when container starts
CMD ["python", "train.py"]
Let’s break it down:

- FROM: Every Dockerfile starts with a base image. It’s your foundation.
- WORKDIR: Sets the default working directory for subsequent commands. It’s like running the command
cd /app
to change intoapp/
but permanent. Docker is like an onion, it has layers. A plaincd
inside oneRUN
only affects that single layer. It does not carry over to the next instruction.
- COPY: Moves files from your computer into the image.
- RUN: Executes commands during image build (installing packages, creating directories, etc.). Each
RUN
command executes in a new shell environment and produces a new image layer. (Think of aRUN
like a layer to the Docker onion). Filesystem changes persist across steps; transient shell state does not.
- ENV: Sets environment variables available at runtime.
- EXPOSE: Documents which port(s) your container listens on at runtime.
- CMD: The default command when someone (or something) runs your container.
FROM
and installing packages) at the top, and things that change often (like copying your code) at the bottom. This makes rebuilds lightning fast! âĄïļ
Building Your First Custom Image
Time to get your hands dirty! We’re going to containerize a simple data analysis pipeline. Something you’d actually use in real work.
Create a folder structure like this:
my-first-docker/
âââ Dockerfile
âââ requirements.txt
âââ analyze.py
âââ data/
âââ sample_data.csv
Step 1: Create Your Files
First, let’s create the data file. This is a simple dataset tracking ML model performance metrics (the kind of thing you’d actually analyze in real life).
data/sample_data.csv
model_accuracy,training_time,data_size
0.85,120,1000
0.87,145,1500
0.89,180,2000
0.91,210,2500
0.88,165,1800
0.92,225,3000
0.86,135,1200
0.90,195,2200
0.93,240,3500
0.89,170,1900
0.87,150,1600
0.91,205,2400
0.88,160,1700
0.90,190,2100
0.85,125,1100
Now create your analysis script.
analyze.py
import pandas as pd
import matplotlib.pyplot as plt
import os
def analyze_data():
# Load data
df = pd.read_csv("data/sample_data.csv")
# Simple analysis
print("Dataset shape:", df.shape)
print("\nSummary statistics:")
print(df.describe())
# Create outputs directory if it doesn't exist
os.makedirs("outputs", exist_ok=True)
# Create a plot
df.plot(kind="hist")
plt.savefig("outputs/output.png")
print("\nPlot saved as outputs/output.png")
if __name__ == "__main__":
analyze_data()
requirements.txt
numpy==1.24.3
pandas==2.0.3
matplotlib==3.7.2
Step 2: Write Your Dockerfile
Dockerfile
# Use specific Python version
FROM python:3.9.18-slim
# Set working directory
WORKDIR /app
# Copy requirements first (caching optimization!)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the project
COPY analyze.py .
COPY data/ ./data/
# Run the analysis
CMD ["python", "analyze.py"]
Why -slim
? The python:3.9.18-slim
image is ~50% smaller than the full python:3.9.18
. For data work, you rarely need the extra system packages in the full version.
Step 3: Build Your Image
# Build the image and tag it
docker build -t my-analysis:v1.0 .
# The '.' means "look for Dockerfile in current directory"
# '-t' tags your image with a name
You’ll see Docker executing each instruction. First time takes a minute or two. Watch the magic happen! ðŠ
Step 4: Run Your Container
# Run your analysis
docker run my-analysis:v1.0
# To save the output plot to your local machine:
docker run -v $(pwd)/outputs:/app/outputs my-analysis:v1.0
What just happened? You packaged your entire analysis environment (Python, pandas, matplotlib, your code) into a single, portable image. Anyone with Docker can run this analysis identically!
Create a .dockerignore
File
# Python artifacts
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
# Jupyter notebooks
*.ipynb
.ipynb_checkpoints/
# Virtual environments
venv/
env/
# Data folders (if large)
data/raw/
# Git
.git/
.gitignore
# IDE
.vscode/
.idea/
I once accidentally included a 2GB dataset in a Docker image because I forgot .dockerignore
. The image took 30 minutes to push to our registry. Don’t be like past-me. Use .dockerignore
! ðĨē
Testing Your Container
# 1. Check if it runs
docker run my-analysis:v1.0
# 2. Inspect what's inside
docker run -it my-analysis:v1.0 bash
ls -la
cat requirements.txt
exit
# 3. Check the image size
docker images my-analysis:v1.0
# 4. Verify logs
docker run my-analysis:v1.0
docker ps -a # Get container ID
docker logs <container_id>
docker run my-analysis:v1.0
, the container completes immediately and stops. Then docker ps -a
shows stopped containers.
docker run -it <your-image> bash
to get an interactive shell. Then manually run commands to debug. It’s like SSH-ing into your container. ð
Congratulations! You just:
- Learned essential Docker commands
- Understood the importance of specific version tags
- Wrote your first Dockerfile
- Built and ran a custom image
- Containerized a real data analysis workflow
Homework ð

Exercise 1: Experiment
Change something in analyze.py
, rebuild with a new tag (v1.1
), and run it. Notice how fast the rebuild is? That’s layer caching!
Exercise 2: Challenge
Try running docker run -it python:3.9.18-slim bash
and play around. Install packages, create files, then exit. Run it again… notice everything resets? Docker containers are ephemeral (that’s nerd talk for temporary).
Exercise 3: Real Project
Pick a simple Python script you’ve written. Containerize it following today’s pattern.
In Part 3 we’ll tackle the biggest challenge for data professionals. Data persistence! How do you work with large datasets, save models, and share data between containers? We’ll dive into volumes, bind mounts, and building multi-container environments with Docker Compose.
Want the extensive command reference? Grab the Docker for Data Professionals Cheat Sheet to keep these commands handy!
Remember: Every expert was once a beginner. Hopefully after working through this you’ve come to the realization that this is actually pretty straightforward. See you later in Part 3 and happy coding!