Getting Started: Your First Docker Container

Part 2 of the “Docker for Data Professionals” Series

Welcome back! If you completed the homework from Part 1, you’ve already run your first containers. Pretty cool, right? You pulled a Jupyter environment with one command. No conda, no pip, just instant setup.

But here’s where it gets really powerful: building your own custom images. This is where Docker transforms from “neat tool” to “how did I ever work without this?”

Today we’re going hands-on. By the end of this post, you’ll understand the essential Docker commands, build your first custom image, and containerize a real data science workflow. Let’s dive in! 🎯

Your Go-To Docker Commands

Before we build anything, let’s master the commands you’ll use daily. Think of these as your Docker vocabulary, learn these and you can navigate any container workflow.

Working with Images

# Pull an image from a registry
docker pull python:3.9.18

# List all images on your system
docker images

# Remove an image (free up space!)
docker rmi python:3.9.18

# Remove all unused images
docker image prune -a

# Pull an image from a registry
docker pull python:3.9.18

# List all images on your system
docker images

# Remove an image (free up space!)
docker rmi python:3.9.18

# Remove all unused images
docker image prune -a

💡 Pro Tip: Docker images can eat up disk space fast. I run docker image prune -a weekly to clean up unused images.

Your SSD will thank you! 💾

Managing Containers

# Run a container (creates and starts it)
docker run python:3.9.18

# List running containers
docker ps

# List ALL containers (including stopped ones)
docker ps -a

# Stop a running container
docker stop <container_id>

# Start a stopped container
docker start <container_id>

# Remove a container
docker rm <container_id>

# Remove all stopped containers
docker container prune

# Run a container (creates and starts it)
docker run python:3.9.18

# List running containers
docker ps

# List ALL containers (including stopped ones)
docker ps -a

# Stop a running container
docker stop <container_id>

# Start a stopped container
docker start <container_id>

# Remove a container
docker rm <container_id>

# Remove all stopped containers
docker container prune

Debugging & Inspection

# View container logs (super useful for debugging!)
docker logs <container_id>

# Execute a command inside a running container
docker exec -it <container_id> bash

# View detailed container info
docker inspect <container_id>

# View container logs (super useful for debugging!)
docker logs <container_id>

# Execute a command inside a running container
docker exec -it <container_id> bash

# View detailed container info
docker inspect <container_id>

Real-world scenario: My training script crashed inside a container. Instead of rebuilding everything, I ran docker logs <container_id> to see the error, then docker exec -it <container_id> bash to poke around and debug. It saved me a lot of time.

Understanding Docker Tags

Remember in Part 1 when I mentioned tags are crucial for reproducibility? Let’s dig deeper. This section could save you hours of debugging.

Docker tags are version labels for images. It’s essential version control for your Docker images. They follow the format image_name:tag. For example:

python:3.9.18 – this is a specific Python version
tensorflow/tensorflow:2.13.0-gpu – this is a specific TensorFlow with GPU support
my-model:latest – this is dangerous 😨 (we’ll get to why!)

The “Latest” Tag Trap 🪤

I’m posting this during spooky season, so here’s a ghost story about latest tags:

I was working with a team deploying a sentiment analysis model. In development, we used FROM python:3.9 in our Dockerfile. Everything worked perfectly. We tested, validated, and deployed to production.

A few months later (without changing any code) production started failing. Customer complaints rolled in. Predictions were suddenly different. Everything was going downhill really fast!

It took such a long time to find such a tiny bug.

What happened? The python:3.9 tag (without a specific patch version) had been updated from Python 3.9.16 to 3.9.18. A subtle change in how Python 3.9.18 handled floating-point operations caused different random seeds, which cascaded through our entire preprocessing pipeline.

The fix? Change one line: FROM python:3.9 → FROM python:3.9.16

Scary huh? Who would’ve thought that something as small as a patch version could cause so much distress.

You might also be wondering what a patch version is. Let’s unpack that.

Semantic Versioning 101

Let’s decode what those version numbers actually mean. You’ve seen tags like python:3.9.18 or tensorflow:2.13.0, but what do those numbers represent?

This is semantic versioning (semver), and it follows a simple pattern: MAJOR.MINOR.PATCH

tensorflow:2.13.0
           │ │  │
           │ │  └─ PATCH (bug fixes only)
           │ └──── MINOR (new features, backward-compatible)
           └────── MAJOR (breaking changes)

tensorflow:2.13.0
           │ │  │
           │ │  └─ PATCH (bug fixes only)
           │ └──── MINOR (new features, backward-compatible)
           └────── MAJOR (breaking changes)

Here’s what each number means:

MAJOR (2): Breaking changes. Code that worked in v1.x might break in v2.x
MINOR (13): New features added, but backward-compatible. v2.12 code works in v2.13
PATCH (0): Bug fixes only. No new features, no breaking changes.

Why this matters for Docker:

When you use python:3.9, you’re saying “give me the latest version of Python 3.9.x available at build time”. On a future rebuild, that tag may point to a newer patch release, so you can get different behavior without changing your code.

This is what I experienced. We assumed the pipeline would be using python:3.9.16 but there was an update and when our Docker image was rebuilt it used the latest version which was python:3.9.18.

When you use python:3.9.16, you’re locked to that exact version. Reproducibility guaranteed!

Semantic versioning isn’t just for Docker. You’ll see it everywhere in data work:

Python packages: pandas==2.0.3
APIs: v2.1.4
ML models: sentiment-classifier:1.2.0
Your own pipelines!

💡 Pro Tip: When versioning your own models or APIs, follow semantic versioning. Changed your preprocessing? Bump the minor version. Fixed a bug in inference? Bump the patch. Completely new architecture? Bump the major version.

Tag Best Practices

Now that you understand version numbers, here are the rules:

✅ DO: Use specific version tags (all three numbers!)

FROM python:3.9.18
FROM tensorflow/tensorflow:2.13.0-gpu
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

FROM python:3.9.18
FROM tensorflow/tensorflow:2.13.0-gpu
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

❌ DON’T: Use vague or latest tags in production

FROM python:3.9         # Which 3.9.x? Could change!
FROM tensorflow:latest   # Latest today ≠ latest tomorrow
FROM ubuntu             # Defaults to latest (bad!)

FROM python:3.9         # Which 3.9.x? Could change!
FROM tensorflow:latest   # Latest today ≠ latest tomorrow
FROM ubuntu             # Defaults to latest (bad!)

When to use latest? Use it during quick local experiments where reproducibility doesn’t matter.

Anatomy of a Dockerfile 🩻

A Dockerfile is just a text file with instructions. Let’s break down the key commands you’ll use constantly.

Core Dockerfile Instructions

# Start from a base image
FROM python:3.9.18

# Set the working directory inside the container
WORKDIR /app

# Copy files from your machine to the container
COPY requirements.txt .
COPY train.py .
COPY data/ ./data/

# Run commands (installing packages, etc.)
RUN pip install --no-cache-dir -r requirements.txt

# Set environment variables
ENV MODEL_PATH=/app/models

# Expose a port (for APIs, Jupyter, etc.)
EXPOSE 8000

# Default command to run when container starts
CMD ["python", "train.py"]

# Start from a base image
FROM python:3.9.18

# Set the working directory inside the container
WORKDIR /app

# Copy files from your machine to the container
COPY requirements.txt .
COPY train.py .
COPY data/ ./data/

# Run commands (installing packages, etc.)
RUN pip install --no-cache-dir -r requirements.txt

# Set environment variables
ENV MODEL_PATH=/app/models

# Expose a port (for APIs, Jupyter, etc.)
EXPOSE 8000

# Default command to run when container starts
CMD ["python", "train.py"]

Let’s break it down:

FROM: Every Dockerfile starts with a base image. It’s your foundation.

WORKDIR: Sets the default working directory for subsequent commands. It’s like running the command cd /app to change into app/ but permanent. Docker is like an onion, it has layers. A plain cd inside one RUN only affects that single layer. It does not carry over to the next instruction.

COPY: Moves files from your computer into the image.

RUN: Executes commands during image build (installing packages, creating directories, etc.). Each RUN command executes in a new shell environment and produces a new image layer. (Think of a RUN like a layer to the Docker onion). Filesystem changes persist across steps; transient shell state does not.

ENV: Sets environment variables available at runtime.

EXPOSE: Documents which port(s) your container listens on at runtime.

CMD: The default command when someone (or something) runs your container.

💡 Pro Tip: Order matters! Docker caches layers. Put things that change rarely (like FROM and installing packages) at the top, and things that change often (like copying your code) at the bottom. This makes rebuilds lightning fast! ⚡️

Building Your First Custom Image

Time to get your hands dirty! We’re going to containerize a simple data analysis pipeline. Something you’d actually use in real work.

Create a folder structure like this:

my-first-docker/
├── Dockerfile
├── requirements.txt
├── analyze.py
└── data/
    └── sample_data.csv

my-first-docker/
├── Dockerfile
├── requirements.txt
├── analyze.py
└── data/
    └── sample_data.csv

Step 1: Create Your Files

First, let’s create the data file. This is a simple dataset tracking ML model performance metrics (the kind of thing you’d actually analyze in real life).

data/sample_data.csv

model_accuracy,training_time,data_size
0.85,120,1000
0.87,145,1500
0.89,180,2000
0.91,210,2500
0.88,165,1800
0.92,225,3000
0.86,135,1200
0.90,195,2200
0.93,240,3500
0.89,170,1900
0.87,150,1600
0.91,205,2400
0.88,160,1700
0.90,190,2100
0.85,125,1100

model_accuracy,training_time,data_size
0.85,120,1000
0.87,145,1500
0.89,180,2000
0.91,210,2500
0.88,165,1800
0.92,225,3000
0.86,135,1200
0.90,195,2200
0.93,240,3500
0.89,170,1900
0.87,150,1600
0.91,205,2400
0.88,160,1700
0.90,190,2100
0.85,125,1100

Now create your analysis script.

analyze.py

import pandas as pd
import matplotlib.pyplot as plt
import os

def analyze_data():
    # Load data
    df = pd.read_csv("data/sample_data.csv")
    
    # Simple analysis
    print("Dataset shape:", df.shape)
    print("\nSummary statistics:")
    print(df.describe())
    
    # Create outputs directory if it doesn't exist
    os.makedirs("outputs", exist_ok=True)
    
    # Create a plot
    df.plot(kind="hist")
    plt.savefig("outputs/output.png")
    print("\nPlot saved as outputs/output.png")

if __name__ == "__main__":
    analyze_data()

import pandas as pd
import matplotlib.pyplot as plt
import os

def analyze_data():
    # Load data
    df = pd.read_csv("data/sample_data.csv")
    
    # Simple analysis
    print("Dataset shape:", df.shape)
    print("\nSummary statistics:")
    print(df.describe())
    
    # Create outputs directory if it doesn't exist
    os.makedirs("outputs", exist_ok=True)
    
    # Create a plot
    df.plot(kind="hist")
    plt.savefig("outputs/output.png")
    print("\nPlot saved as outputs/output.png")

if __name__ == "__main__":
    analyze_data()

requirements.txt

numpy==1.24.3
pandas==2.0.3
matplotlib==3.7.2

numpy==1.24.3
pandas==2.0.3
matplotlib==3.7.2

Step 2: Write Your Dockerfile

Dockerfile

# Use specific Python version
FROM python:3.9.18-slim

# Set working directory
WORKDIR /app

# Copy requirements first (caching optimization!)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the project
COPY analyze.py .
COPY data/ ./data/

# Run the analysis
CMD ["python", "analyze.py"]

# Use specific Python version
FROM python:3.9.18-slim

# Set working directory
WORKDIR /app

# Copy requirements first (caching optimization!)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the project
COPY analyze.py .
COPY data/ ./data/

# Run the analysis
CMD ["python", "analyze.py"]

Why -slim? The python:3.9.18-slim image is ~50% smaller than the full python:3.9.18. For data work, you rarely need the extra system packages in the full version.

Step 3: Build Your Image

# Build the image and tag it
docker build -t my-analysis:v1.0 .

# The '.' means "look for Dockerfile in current directory"
# '-t' tags your image with a name

# Build the image and tag it
docker build -t my-analysis:v1.0 .

# The '.' means "look for Dockerfile in current directory"
# '-t' tags your image with a name

You’ll see Docker executing each instruction. First time takes a minute or two. Watch the magic happen! 🪄

Step 4: Run Your Container

# Run your analysis
docker run my-analysis:v1.0

# To save the output plot to your local machine:
docker run -v $(pwd)/outputs:/app/outputs my-analysis:v1.0

# Run your analysis
docker run my-analysis:v1.0

# To save the output plot to your local machine:
docker run -v $(pwd)/outputs:/app/outputs my-analysis:v1.0

What just happened? You packaged your entire analysis environment (Python, pandas, matplotlib, your code) into a single, portable image. Anyone with Docker can run this analysis identically!

Create a `.dockerignore` File

# Python artifacts
__pycache__/
*.pyc
*.pyo
*.pyd
.Python

# Jupyter notebooks
*.ipynb
.ipynb_checkpoints/

# Virtual environments
venv/
env/

# Data folders (if large)
data/raw/

# Git
.git/
.gitignore

# IDE
.vscode/
.idea/

# Python artifacts
__pycache__/
*.pyc
*.pyo
*.pyd
.Python

# Jupyter notebooks
*.ipynb
.ipynb_checkpoints/

# Virtual environments
venv/
env/

# Data folders (if large)
data/raw/

# Git
.git/
.gitignore

# IDE
.vscode/
.idea/

I once accidentally included a 2GB dataset in a Docker image because I forgot .dockerignore. The image took 30 minutes to push to our registry. Don’t be like past-me. Use .dockerignore! 🥲

Testing Your Container

# 1. Check if it runs
docker run my-analysis:v1.0

# 2. Inspect what's inside
docker run -it my-analysis:v1.0 bash
ls -la
cat requirements.txt
exit

# 3. Check the image size
docker images my-analysis:v1.0

# 4. Verify logs
docker run my-analysis:v1.0
docker ps -a  # Get container ID
docker logs <container_id>

# 1. Check if it runs
docker run my-analysis:v1.0

# 2. Inspect what's inside
docker run -it my-analysis:v1.0 bash
ls -la
cat requirements.txt
exit

# 3. Check the image size
docker images my-analysis:v1.0

# 4. Verify logs
docker run my-analysis:v1.0
docker ps -a  # Get container ID
docker logs <container_id>

📝 Note: When you run docker run my-analysis:v1.0, the container completes immediately and stops. Then docker ps -a shows stopped containers.

💡 Pro Tip: If something breaks, use docker run -it <your-image> bash to get an interactive shell. Then manually run commands to debug. It’s like SSH-ing into your container. 🔍

Congratulations! You just:

Learned essential Docker commands
Understood the importance of specific version tags
Wrote your first Dockerfile
Built and ran a custom image
Containerized a real data analysis workflow

Homework 📄

Exercise 1: Experiment

Change something in analyze.py, rebuild with a new tag (v1.1), and run it. Notice how fast the rebuild is? That’s layer caching!

Exercise 2: Challenge

Try running docker run -it python:3.9.18-slim bash and play around. Install packages, create files, then exit. Run it again… notice everything resets? Docker containers are ephemeral (that’s nerd talk for temporary).

Exercise 3: Real Project

Pick a simple Python script you’ve written. Containerize it following today’s pattern.

In Part 3 we’ll tackle the biggest challenge for data professionals. Data persistence! How do you work with large datasets, save models, and share data between containers? We’ll dive into volumes, bind mounts, and building multi-container environments with Docker Compose.

Want the extensive command reference? Grab the Docker for Data Professionals Cheat Sheet to keep these commands handy!

Remember: Every expert was once a beginner. Hopefully after working through this you’ve come to the realization that this is actually pretty straightforward. See you later in Part 3 and happy coding!