Production ML Patterns & Debugging

Part 4 of the “Docker for Data Professionals” Series

Welcome to Part 4! You’ve learned Docker fundamentals, built custom images, and orchestrated multi-container environments with Compose. Now it’s time to level up your engineering skills with patterns that separate “it works on my machine” from ‘it works in production”.

Here’s the reality: your PM doesn’t care that your sentiment analysis model gets 94% accuracy in your Jupyter notebook. They care whether it can handle 1,000 requests per minute without falling over. Today, we’re building things that can ship to production.

You’ll containerize a complete ML model serving GPU, configure GPU support for training jobs, and learn to debug the issues that will inevitably happen. These are the patterns you’ll see in production codebases. Let’s get into it.

Containerizing an ML Model API

Here’s the scenario: you’ve trained a sentiment analysis model in a notebook. It works great. Now your PM wants it deployed as an API so the product team can integrate it. You need to go from research code to production service.

At most companies, a machine learning engineer handles this transition from research to production. But if you’re a data scientist looking to expand your skill set, or you’re at a smaller company where you wear multiple hats, these are the skills that set you apart.

What is the key difference between research code and production code? Your API needs to load the model once at start up (not on every request, because that would be too slow) and serve predictions via HTTP endpoints.

The Setup

We’re building a FastAPI service that handles real traffic. Here’s the project’s structure:

sentiment-api/
├── Dockerfile
├── requirements.txt
├── app.py
├── model.pkl
├── train_model.py
└── .dockerignore

sentiment-api/
├── Dockerfile
├── requirements.txt
├── app.py
├── model.pkl
├── train_model.py
└── .dockerignore

Step 1: Create the API

Start by creating your project folder (I’m calling mine sentiment-api/ but name it whatever makes sense for your project). Inside that folder, create app.py. This is where things shift from research code to production code. We need proper error handling so one bad request doesn’t crash the entire service. We need logging so you can debug issues in production. And we need health checks so container orchestrators like Kubernetes know if your service is actually working.

`app.py`

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import logging

# Configure logging so Docker can capture it properly
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load model once at startup, not on every request
logger.info("Loading model...")
model = joblib.load("model.pkl")
logger.info("Model loaded successfully!")

app = FastAPI()

class PredictionRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    sentiment: str
    confidence: float

@app.get("/health")
def health_check():
    """Health endpoint for container orchestration"""
    return {"status": "healthy"}

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    """Make a prediction on the input text"""
    try:
        prediction = model.predict([request.text])[0]
        proba = model.predict_proba([request.text])[0]
        confidence = float(max(proba))
        
        logger.info(f"Prediction made with confidence: {confidence:.2f}")
        
        return PredictionResponse(
            sentiment=prediction,
            confidence=confidence
        )
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        raise

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import logging

# Configure logging so Docker can capture it properly
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load model once at startup, not on every request
logger.info("Loading model...")
model = joblib.load("model.pkl")
logger.info("Model loaded successfully!")

app = FastAPI()

class PredictionRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    sentiment: str
    confidence: float

@app.get("/health")
def health_check():
    """Health endpoint for container orchestration"""
    return {"status": "healthy"}

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    """Make a prediction on the input text"""
    try:
        prediction = model.predict([request.text])[0]
        proba = model.predict_proba([request.text])[0]
        confidence = float(max(proba))
        
        logger.info(f"Prediction made with confidence: {confidence:.2f}")
        
        return PredictionResponse(
            sentiment=prediction,
            confidence=confidence
        )
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        raise

`requirements.txt`

fastapi==0.104.1
uvicorn==0.24.0
joblib==1.3.2
scikit-learn==1.3.2
pydantic==2.5.0

fastapi==0.104.1
uvicorn==0.24.0
joblib==1.3.2
scikit-learn==1.3.2
pydantic==2.5.0

Step 1.2: Create and Train a Model

If you don’t already have a model to containerize, you can use my script to create a model for you and serialize that trained model as a pickle file.

`train_model.py`

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import joblib
import numpy as np

sample_texts = [
    "I love this product! It's amazing and works great.",
    "This is the best thing I've ever bought. Highly recommend!",
    "Excellent quality and fast shipping. Very satisfied.",
    "Great service and fantastic product. Will buy again!",
    "Outstanding experience from start to finish.",
    "Really impressed with the quality and value.",
    "Exceeded my expectations in every way!",
    "Perfect! Exactly what I was looking for.",
    "Terrible product. Complete waste of money.",
    "Very disappointed. Does not work as advertised.",
    "Poor quality and terrible customer service.",
    "Horrible experience. Would not recommend to anyone.",
    "Broke after one day of use. Very frustrating.",
    "Not worth the price. Many better options available.",
    "Completely useless. Returning immediately.",
    "Worst purchase I've ever made. Stay away!",
    "It's okay, nothing special but does the job.",
    "Average product, neither good nor bad.",
    "Decent quality for the price, but nothing extraordinary.",
    "Works as expected, no complaints but not impressed either.",
]

sample_labels = [
    "positive", "positive", "positive", "positive",
    "positive", "positive", "positive", "positive",
    "negative", "negative", "negative", "negative",
    "negative", "negative", "negative", "negative",
    "neutral", "neutral", "neutral", "neutral"
]

texts_extended = sample_texts * 25
labels_extended = sample_labels * 25

for variation in range(50):
    texts_extended.extend([
        f"Amazing product! Love it so much. Variation {variation}",
        f"Excellent quality and design. Worth every penny! Version {variation}",
        f"Terrible experience. Very disappointed. Case {variation}",
        f"Worst product ever. Do not buy! Instance {variation}",
        f"It's fine, nothing special. Variant {variation}",
        f"Okay product, average quality. Sample {variation}",
    ])
    labels_extended.extend(["positive", "positive", "negative", "negative", "neutral", "neutral"])

X_train, X_test, y_train, y_test = train_test_split(
    texts_extended, labels_extended, test_size=0.2, random_state=42, stratify=labels_extended
)

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=500, ngram_range=(1, 2))),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

print("Training sentiment analysis model...")
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

joblib.dump(pipeline, 'model.pkl')
print("\nModel saved as 'model.pkl'")

test_sentences = [
    "This is absolutely wonderful!",
    "I hate this so much.",
    "It's okay, I guess.",
]

print("\n" + "="*50)
print("Testing the model with new sentences:")
print("="*50)
for sentence in test_sentences:
    prediction = pipeline.predict([sentence])[0]
    proba = pipeline.predict_proba([sentence])[0]
    confidence = max(proba)
    print(f"\nText: '{sentence}'")
    print(f"Prediction: {prediction} (confidence: {confidence:.2%})")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import joblib
import numpy as np

sample_texts = [
    "I love this product! It's amazing and works great.",
    "This is the best thing I've ever bought. Highly recommend!",
    "Excellent quality and fast shipping. Very satisfied.",
    "Great service and fantastic product. Will buy again!",
    "Outstanding experience from start to finish.",
    "Really impressed with the quality and value.",
    "Exceeded my expectations in every way!",
    "Perfect! Exactly what I was looking for.",
    "Terrible product. Complete waste of money.",
    "Very disappointed. Does not work as advertised.",
    "Poor quality and terrible customer service.",
    "Horrible experience. Would not recommend to anyone.",
    "Broke after one day of use. Very frustrating.",
    "Not worth the price. Many better options available.",
    "Completely useless. Returning immediately.",
    "Worst purchase I've ever made. Stay away!",
    "It's okay, nothing special but does the job.",
    "Average product, neither good nor bad.",
    "Decent quality for the price, but nothing extraordinary.",
    "Works as expected, no complaints but not impressed either.",
]

sample_labels = [
    "positive", "positive", "positive", "positive",
    "positive", "positive", "positive", "positive",
    "negative", "negative", "negative", "negative",
    "negative", "negative", "negative", "negative",
    "neutral", "neutral", "neutral", "neutral"
]

texts_extended = sample_texts * 25
labels_extended = sample_labels * 25

for variation in range(50):
    texts_extended.extend([
        f"Amazing product! Love it so much. Variation {variation}",
        f"Excellent quality and design. Worth every penny! Version {variation}",
        f"Terrible experience. Very disappointed. Case {variation}",
        f"Worst product ever. Do not buy! Instance {variation}",
        f"It's fine, nothing special. Variant {variation}",
        f"Okay product, average quality. Sample {variation}",
    ])
    labels_extended.extend(["positive", "positive", "negative", "negative", "neutral", "neutral"])

X_train, X_test, y_train, y_test = train_test_split(
    texts_extended, labels_extended, test_size=0.2, random_state=42, stratify=labels_extended
)

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=500, ngram_range=(1, 2))),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

print("Training sentiment analysis model...")
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

joblib.dump(pipeline, 'model.pkl')
print("\nModel saved as 'model.pkl'")

test_sentences = [
    "This is absolutely wonderful!",
    "I hate this so much.",
    "It's okay, I guess.",
]

print("\n" + "="*50)
print("Testing the model with new sentences:")
print("="*50)
for sentence in test_sentences:
    prediction = pipeline.predict([sentence])[0]
    proba = pipeline.predict_proba([sentence])[0]
    confidence = max(proba)
    print(f"\nText: '{sentence}'")
    print(f"Prediction: {prediction} (confidence: {confidence:.2%})")

Step 2: Create a Production-Grade `Dockerfile`

Here’s where we level up from Part 3. This Dockerfile uses multi-stage builds to create smaller, more secure images.

`Dockerfile`

# Stage 1: Builder
FROM python:3.11-slim as builder

WORKDIR /app

# Install dependencies in a separate stage
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Stage 2: Runtime
FROM python:3.11-slim

# Create non-root user for security
RUN useradd -m -u 1000 apiuser

WORKDIR /app

# Copy only the dependencies from builder
COPY --from=builder /root/.local /home/apiuser/.local

# Copy application code
COPY app.py model.pkl ./

# Make sure scripts are in PATH
ENV PATH=/home/apiuser/.local/bin:$PATH

# Switch to non-root user
USER apiuser

# Expose port
EXPOSE 8000

# Health check (orchestrators use this!)
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1

# Run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

# Stage 1: Builder
FROM python:3.11-slim as builder

WORKDIR /app

# Install dependencies in a separate stage
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Stage 2: Runtime
FROM python:3.11-slim

# Create non-root user for security
RUN useradd -m -u 1000 apiuser

WORKDIR /app

# Copy only the dependencies from builder
COPY --from=builder /root/.local /home/apiuser/.local

# Copy application code
COPY app.py model.pkl ./

# Make sure scripts are in PATH
ENV PATH=/home/apiuser/.local/bin:$PATH

# Switch to non-root user
USER apiuser

# Expose port
EXPOSE 8000

# Health check (orchestrators use this!)
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1

# Run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Why Use This Approach?

With multi-stage builds, we can install dependencies during the builder stage, then copy only what we need to the runtime stage. Your final image ends up substantially smaller, which means faster deployments and lower storage costs.

Running as a non-root user is a security requirement at most companies. If someone exploits your container, they don’t get root access to potentially escape the container or access the host system.

The health check tells orchestrators whether your container is actually working or not. If it fails three times in a row, the orchestrator knows to restart your container automatically.

We’re using Python’s logging module instead of print statements because Docker captures logs properly this way. You’ll be able to search and filter them in CloudWatch, Datadog, or whichever logging system your company uses.

The --no-cache-dir flag in pip keeps the image smaller by not storing downloaded packages.

Step 3: Add a `.dockerignore`

This works like .gitignore, but for Docker. It prevents unnecessary files from bloating your image or accidentally including sensitive data.

`.dockerignore`

__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.so
*.egg
*.egg-info/
dist/
build/
.git/
.gitignore
.vscode/
.idea/
*.ipynb
.ipynb_checkpoints/
README.md
tests/

__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.so
*.egg
*.egg-info/
dist/
build/
.git/
.gitignore
.vscode/
.idea/
*.ipynb
.ipynb_checkpoints/
README.md
tests/

Step 4: Build and Run

# Build the image
docker build -t sentiment-api:v1.0.0 .

# Run it
docker run -p 8324:8000 sentiment-api:v1.0.0

# Test it
curl -X POST http://localhost:8324/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This movie was amazing!"}'

# Build the image
docker build -t sentiment-api:v1.0.0 .

# Run it
docker run -p 8324:8000 sentiment-api:v1.0.0

# Test it
curl -X POST http://localhost:8324/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This movie was amazing!"}'

❗️ Side Note: Just like in Part 3, I’m using a different port (port 8324) because port 8000 was already being used on my machine.

💡 Tip: To check what’s using a port:
– Mac/Linux: lsof -i :5000
– Windows: netstat -ano | findstr :5000

After running the curl command, you should get back:

{
  "sentiment": "positive",
  "confidence": 0.8150972762233741
}

{
  "sentiment": "positive",
  "confidence": 0.8150972762233741
}

🎉 That’s it! You just containerized a production ML API. This same container will run identically on your laptop, in AWS, in GCP, on Azure, pretty much anywhere Docker runs. If it works locally, it’ll work in production (assuming you didn’t hardcode anything 😬🤞🏾).

GPU Support for Deep Learning

If you’re training deep learning models, you need GPUs. Here’s how to actually use them in Docker without fighting with CUDA installations.

Unfortunately, the NVIDIA Container Toolkit is specific to Linux and does not work on macOS. If you are running Docker on a Mac, the containers will not have direct access to the GPU. If you need GPU-accelerated containers, you’ll need to run them on a Linux machine with an NVIDIA GPU (like a cloud instance or dedicated Linux box).

If you’re using Windows, there is a solution. WSL2, which stands for Windows Subsystem for Linux. The Windows machine also needs NVIDIA GPUs with recent drivers installed on it. Windows facilitates GPU access to Docker containers through WSL2. Docker Desktop automatically manages GPU passthrough via WSL2.

Prerequisites

On your host machine:

NVIDIA GPU with recent drivers
NVIDIA Container Toolkit installed.

The code snippet below is a bash script for installing the NVIDIA Container Toolkit on Ubuntu. This allows Docker containers to access NVIDIA GPUs, and it’s needed for running GPU-accelerated applications in Docker.

You can also read the docs on installing the toolkit.

# Install on Ubuntu (check NVIDIA docs for other systems)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Install on Ubuntu (check NVIDIA docs for other systems)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

`Dockerfile` for GPU Training

We’ll start with NVIDIA’s official CUDA base images. Don’t try to install CUDA yourself, unless you want to waste hours debugging driver mismatches and other cryptic issues…ask me how I know 🙃.

`Dockerfile.gpu`

# Start with NVIDIA's CUDA base image
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

# Install PyTorch with CUDA support
COPY requirements-gpu.txt .
RUN pip3 install --no-cache-dir --upgrade pip && \
    pip3 install --no-cache-dir torch==2.1.0+cu121 torchvision==0.16.0+cu121 --index-url https://download.pytorch.org/whl/cu121 && \
    pip3 install --no-cache-dir transformers==4.35.0 accelerate==0.24.0

# Copy training script
COPY train_model.py .

CMD ["python3", "train_model.py"]

# Start with NVIDIA's CUDA base image
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

# Install PyTorch with CUDA support
COPY requirements-gpu.txt .
RUN pip3 install --no-cache-dir --upgrade pip && \
    pip3 install --no-cache-dir torch==2.1.0+cu121 torchvision==0.16.0+cu121 --index-url https://download.pytorch.org/whl/cu121 && \
    pip3 install --no-cache-dir transformers==4.35.0 accelerate==0.24.0

# Copy training script
COPY train_model.py .

CMD ["python3", "train_model.py"]

`requirements-gpu.txt`

torch==2.1.0+cu121
torchvision==0.16.0+cu121
transformers==4.35.0
accelerate==0.24.0

torch==2.1.0+cu121
torchvision==0.16.0+cu121
transformers==4.35.0
accelerate==0.24.0

Notice we’re installing PyTorch with CUDA 12.1 support (+cu121). The version needs to match your base image’s CUDA version; otherwise, PyTorch won’t see your GPUs.

Running with GPU Access

# Build GPU image
docker build -f Dockerfile.gpu -t pytorch-trainer:gpu .

# Run with GPU access
docker run --gpus all \
  -v $(pwd)/data:/workspace/data \
  -v $(pwd)/models:/workspace/models \
  pytorch-trainer:gpu

# Build GPU image
docker build -f Dockerfile.gpu -t pytorch-trainer:gpu .

# Run with GPU access
docker run --gpus all \
  -v $(pwd)/data:/workspace/data \
  -v $(pwd)/models:/workspace/models \
  pytorch-trainer:gpu

The --gpus all flag gives the container access to all your GPUs. You can also choose which GPUs to use if you’re sharing a multi-GPU machine.

# Use only GPU 0
docker run --gpus '"device=0"' pytorch-trainer:gpu

# Use GPUs 0 and 1
docker run --gpus '"device=0,1"' pytorch-trainer:gpu

# Use only GPU 0
docker run --gpus '"device=0"' pytorch-trainer:gpu

# Use GPUs 0 and 1
docker run --gpus '"device=0,1"' pytorch-trainer:gpu

GPU in Docker Compose

`docker-compose.gpu.yml`

version: '3.8'

services:
  trainer:
    build:
      context: .
      dockerfile: Dockerfile.gpu
    volumes:
      - ./data:/workspace/data
      - ./models:/workspace/models
      - ./logs:/workspace/logs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # or specific count
              capabilities: [gpu]

version: '3.8'

services:
  trainer:
    build:
      context: .
      dockerfile: Dockerfile.gpu
    volumes:
      - ./data:/workspace/data
      - ./models:/workspace/models
      - ./logs:/workspace/logs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # or specific count
              capabilities: [gpu]

Now we run it.

docker-compose -f docker-compose.gpu.yml up

docker-compose -f docker-compose.gpu.yml up

You should always verify GPU access before starting a long training run.

docker run --gpus all pytorch-trainer:gpu python3 -c "import torch; print(torch.cuda.is_available())"

docker run --gpus all pytorch-trainer:gpu python3 -c "import torch; print(torch.cuda.is_available())"

If it prints True, you’re good to go. If it prints False, you’ve got a configuration problem to debug before wasting hours on GPU training.

Debugging Common Problems

Everyone has seen an error pop-up before. I’ve seen far too many. Here are some issues I’ve come across and how I’ve handled them 😤.

❗️ 1: Container Exits Immediately

Symptom: You run docker run my-app:v1.0.0 and it exits away with no obvious error.

How to Debug It

# Check the logs first
docker logs <container-id>

# Or run with --rm to see output immediately
docker run --rm my-app:v1.0.0

# Check the logs first
docker logs <container-id>

# Or run with --rm to see output immediately
docker run --rm my-app:v1.0.0

Common Causes and Fixes

1. The application crashed on startup.

This is usually a Python import error or a missing file. The logs will tell you exactly what failed.

Common Culprits

Import fails because a dependency isn’t installed
File path is wrong (remember, paths are relative to WORKDIR)
An environment variable is missing

2: No long-running process

Your container needs a process that stays running. If your CMD finishes quickly, the container exits.

# dockerfile

# Bad - exits immediately after training finishes
CMD ["python", "train.py"]

# Good - stays running
CMD ["uvicorn", "app:app", "--host", "0.0.0.0"]

# dockerfile

# Bad - exits immediately after training finishes
CMD ["python", "train.py"]

# Good - stays running
CMD ["uvicorn", "app:app", "--host", "0.0.0.0"]

If you need to run a one-off training job, that’s fine, the container should exit when it’s done. Just make sure you’re mounting volumes to save your outputs.

3. Wrong working directory

# dockerfile

WORKDIR /app
COPY app.py .
# Make sure CMD can actually find app.py at /app/app.py

# dockerfile

WORKDIR /app
COPY app.py .
# Make sure CMD can actually find app.py at /app/app.py

❗️ 2: Import Errors Inside the Container

Symptom: ModuleNotFoundError: No module named 'some_package' even though the package is clearly in the requirements.txt file 🙄.

How to Debug It

# Get a shell in the running container
docker exec -it <container-id> bash

# Or if container keeps exiting, override the entrypoint
docker run -it --entrypoint bash my-app:v1.0.0

# Inside container, try importing
python -c "import problematic_module"

# Get a shell in the running container
docker exec -it <container-id> bash

# Or if container keeps exiting, override the entrypoint
docker run -it --entrypoint bash my-app:v1.0.0

# Inside container, try importing
python -c "import problematic_module"

Common Causes and Fixes

1. Wrong base image

If you’re using python:3.11-slim, you’re missing a lot of system libraries. Some Python packages need to compile C extensions and will fail if those extensions are not included.

# dockerfile

# If you need packages that compile C extensions
FROM python:3.11  # Full image, not slim

# Or install what you need
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
    gcc \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

# dockerfile

# If you need packages that compile C extensions
FROM python:3.11  # Full image, not slim

# Or install what you need
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
    gcc \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

2. Requirements might not be installed

Check your layer order. Docker executes commands sequentially.

# dockerfile

# Good - requirements installed before copying code
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

# Bad - requirements copied but never installed
COPY . .
# You forgot RUN pip install!

# dockerfile

# Good - requirements installed before copying code
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

# Bad - requirements copied but never installed
COPY . .
# You forgot RUN pip install!

3. Virtual environment confusion

Don’t use venv inside containers. Your container is already isolated from the host system; that’s the whole point of Docker.

# dockerfile

# Don't do this
RUN python -m venv venv
RUN source venv/bin/activate  # This won't persist to the next layer!

# Just install directly
RUN pip install -r requirements.txt

# dockerfile

# Don't do this
RUN python -m venv venv
RUN source venv/bin/activate  # This won't persist to the next layer!

# Just install directly
RUN pip install -r requirements.txt

❗️ 3: Can’t Connect to Services

Symptom: Your Jupyter container can’t reach PostgreSQL even though both containers are running.

How to Debug It

# Check if containers are on the same network
docker network ls
docker inspect <container-id> | grep NetworkMode

# Check if the service is actually running
docker-compose logs postgres
docker-compose logs jupyter

# Check if containers are on the same network
docker network ls
docker inspect <container-id> | grep NetworkMode

# Check if the service is actually running
docker-compose logs postgres
docker-compose logs jupyter

Common Mistakes and Fixes

😈 Using `localhost` instead of service name

This is the most common networking mistake. Inside a container, localhost refers to that container, not your host machine.

# Wrong - localhost is the container itself
engine = create_engine('postgresql://user:pass@localhost:5432/db')

# Right - use the service name from docker-compose.yml
engine = create_engine('postgresql://user:pass@postgres:5432/db')

# Wrong - localhost is the container itself
engine = create_engine('postgresql://user:pass@localhost:5432/db')

# Right - use the service name from docker-compose.yml
engine = create_engine('postgresql://user:pass@postgres:5432/db')

Docker Compose creates a network where services can find each other by name. If your service is called postgres in the Compose file, that’s its hostname.

😈 Services not on the same network

# Both services must be in the same compose file
# OR explicitly connected to the same network
services:
  app:
    networks:
      - my-network
  db:
    networks:
      - my-network

networks:
  my-network:

# Both services must be in the same compose file
# OR explicitly connected to the same network
services:
  app:
    networks:
      - my-network
  db:
    networks:
      - my-network

networks:
  my-network:

😈 Service not ready yet

Remember from Part 3? The depends_on option doesn’t wait for the service to be ready; it waits for the container to start.

# Add retry logic in your application code
import time
from sqlalchemy import create_engine

for attempt in range(5):
    try:
        engine = create_engine(DATABASE_URL)
        engine.connect()
        break
    except Exception as e:
        if attempt < 4:
            print(f"Database not ready, retrying... ({attempt + 1}/5)")
            time.sleep(2)
        else:
            raise

# Add retry logic in your application code
import time
from sqlalchemy import create_engine

for attempt in range(5):
    try:
        engine = create_engine(DATABASE_URL)
        engine.connect()
        break
    except Exception as e:
        if attempt < 4:
            print(f"Database not ready, retrying... ({attempt + 1}/5)")
            time.sleep(2)
        else:
            raise

❗️ 4: Running Out of Disk Space

Symptom: docker build fails with “no space left on device” even though you swear you have space.

Docker images pile up fast, especially if you’re iterating on builds. Each failed build leaves behind layers.

Check what’s using space

# Overview
docker system df

# Detailed breakdown
docker system df -v

# Overview
docker system df

# Detailed breakdown
docker system df -v

Clean up using these commands

# Remove stopped containers
docker container prune

# Remove unused images
docker image prune

# Remove unused volumes (be careful with this one)
docker volume prune

# Remove everything not currently in use
docker system prune -a

# Remove stopped containers
docker container prune

# Remove unused images
docker image prune

# Remove unused volumes (be careful with this one)
docker volume prune

# Remove everything not currently in use
docker system prune -a

Keep your images lean in your `Dockerfile`

# Clean up in the same layer to reduce image size
RUN apt-get update && \
    apt-get install -y package && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Use --no-cache-dir with pip
RUN pip install --no-cache-dir -r requirements.txt

# Clean up in the same layer to reduce image size
RUN apt-get update && \
    apt-get install -y package && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Use --no-cache-dir with pip
RUN pip install --no-cache-dir -r requirements.txt

Each RUN command creates a new layer. If you install something in one layer and delete it in another, both layers still exist in the final image. Clean up the same RUN command.

I have a cron job that runs docker system prune -f every Sunday night. Keeps things manageable without me thinking about it.

❗️ 5: Slow Build Times

Symptom: docker build takes forever, especially the pip install step. You change one line of code and have to wait 5 minutes for a rebuild.

Fix it with Better Layer Ordering

Docker caches layers. If a layer hasn’t changed, Docker reuses the cached version. The trick is putting things that change rarely at the top and things that change frequently at the bottom.

# dockerfile 

# Good - requirements rarely change, so this layer stays cached
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .  # Code changes don't invalidate the pip cache

# Bad - every code change reinstalls everything
COPY . .
RUN pip install -r requirements.txt

# dockerfile 

# Good - requirements rarely change, so this layer stays cached
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .  # Code changes don't invalidate the pip cache

# Bad - every code change reinstalls everything
COPY . .
RUN pip install -r requirements.txt

Use BuildKit

Docker’s new build engine is significantly faster. Enable it once and forget about it.

# One-time for a build
DOCKER_BUILDKIT=1 docker build -t my-app:v1.0.0 .

# Make it default (add to ~/.bashrc or ~/.zshrc)
export DOCKER_BUILDKIT=1

# One-time for a build
DOCKER_BUILDKIT=1 docker build -t my-app:v1.0.0 .

# Make it default (add to ~/.bashrc or ~/.zshrc)
export DOCKER_BUILDKIT=1

Cache `pip` Downloads During Build

This is more advanced, but if you’re rebuilding frequently, try the command below.

# dockerfile

# Mount pip cache during build
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

# dockerfile

# Mount pip cache during build
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

Use a More Specific Base Image

Instead of starting from scratch every time, try this.

# dockerfile

# Instead of generic Python
FROM python:3.11-slim

# Use an image with common ML packages pre-installed
FROM continuumio/miniconda3:25.3.1-1

# dockerfile

# Instead of generic Python
FROM python:3.11-slim

# Use an image with common ML packages pre-installed
FROM continuumio/miniconda3:25.3.1-1

Production Readiness Checklist 📋

Before deploying to production, walk through this checklist to save you future hassle.

Security

Running as a non-root user
No secrets hardcoded in Dockerfile or code (use environment variables or secrets management)
Using specific version tags, not latest
Scanned image for vulnerabilities (use tools like Trivy or Snyk)
Use minimal base image (slim or alpine variants when possible)

Reliability

Health check configured properly
Proper logging with appropriate levels (not just print statements)
Graceful shutdown handling for long-running processes
Set resource limits (prevents one container from eating all CPU/memory)
Have retry logic for external dependencies (databases, APIs)

Performance

Multi-stage build reducing final image size
Layers ordered for maximum cache efficiency
No unnecessary files in image (check .dockerignore)
Model loaded once at startup, not per request
Connection pooling for database connections

Observability

Application metrics exposed (e.g., response times, error rates)
Logs structured in JSON format for easy parsing
Error tracking configured (use Sentry, Rollbar, etc.)
Health endpoint responds in under 1 second
Version and build info included in logs

Congratulations! You just:

Built production-ready ML APIs with proper error handling and logging
Used multi-stage builds to create smaller, more secure images
Configured GPU access for deep learning training
Learned debugging patterns for common Docker issues

Homework 📚

Exercise 1: Build Your Own API

Take one of your ML models (it doesn’t have to be a complex model) and containerize it with FastAPI. Add health checks, proper logging, and run it as a non-root user. Make sure to test error handling by intentionally sending bad inputs.

Exercise 2: Break Things on Purpose

Create a Dockerfile with wrong dependencies. Use localhost instead of service names in a Compose file. Practice debugging with docker logs and docker exec. The goal is to see these errors in a safe environment so you recognize them instantly when they inevitably happen in production.

Exercise 3: Optimize for Speed

Take an existing Dockerfile and measure the build time. Reorder the layers for better caching. Add multi-stage builds. Enable BuildKit. Measure again. You should be able to cut build time by at least 50%.

In Part 5, we’re covering production deployment. Container registries (Docker Hub, ECR, GHCR), security scanning, orchestration platforms (Kubernetes, ECS), and CI/CD pipelines. This is the final piece you need to understand how to confidently deploy containers to production.

Want the quick command reference? Grab the Docker for Data Professionals Cheat Sheet to keep these commands handy!

Remember: Four parts ago, you were learning what containers even are. Now you’re building production-grade ML APIs with multi-stage builds, configuring GPU support, and debugging issues systematically. That’s significant progress. Happy coding 👍🏾.

Production ML Patterns & Debugging

Part 4 of the “Docker for Data Professionals” Series

Containerizing an ML Model API

The Setup

Step 1: Create the API

app.py

requirements.txt

Step 1.2: Create and Train a Model

train_model.py

Step 2: Create a Production-Grade Dockerfile

Dockerfile

Why Use This Approach?

Step 3: Add a .dockerignore

.dockerignore

Step 4: Build and Run

GPU Support for Deep Learning

Prerequisites

Dockerfile for GPU Training

Dockerfile.gpu

requirements-gpu.txt

GPU in Docker Compose

docker-compose.gpu.yml

Debugging Common Problems

❗️ 1: Container Exits Immediately

How to Debug It

Common Causes and Fixes

1. The application crashed on startup.

2: No long-running process

3. Wrong working directory

❗️ 2: Import Errors Inside the Container

How to Debug It

Common Causes and Fixes

1. Wrong base image

2. Requirements might not be installed

3. Virtual environment confusion

❗️ 3: Can’t Connect to Services

How to Debug It

😈 Using localhost instead of service name

😈 Services not on the same network

😈 Service not ready yet

❗️ 4: Running Out of Disk Space

Check what’s using space

Clean up using these commands

Keep your images lean in your Dockerfile

❗️ 5: Slow Build Times

Fix it with Better Layer Ordering

Use BuildKit

Cache pip Downloads During Build

Use a More Specific Base Image

Production Readiness Checklist 📋

Security

Reliability

Performance

Observability

Congratulations! You just:

Homework 📚

Exercise 1: Build Your Own API

Exercise 2: Break Things on Purpose

Exercise 3: Optimize for Speed

Some Related Posts

`app.py`

`requirements.txt`

`train_model.py`

Step 2: Create a Production-Grade `Dockerfile`

`Dockerfile`

Step 3: Add a `.dockerignore`

`.dockerignore`

`Dockerfile` for GPU Training

`Dockerfile.gpu`

`requirements-gpu.txt`

`docker-compose.gpu.yml`

😈 Using `localhost` instead of service name

Keep your images lean in your `Dockerfile`

Cache `pip` Downloads During Build