His Docker Containers Crashed at 3am. The Fix Took Five Minutes.

Tom runs a SaaS in Manchester with 12 containers in production. At 3am last Tuesday, they all went down — database, API, frontend. Root cause: nobody set memory limits. One container consumed all 32GB of RAM and the OOM killer terminated everything. Five settings would've prevented it.

3am

When Tom's SaaS went down

5 min

Time to fix — if he'd known

90%

Of outages are config errors

Common mistakes we check for

Tom runs a SaaS platform in Manchester with 2,400 paying customers and 12 containers in production. At 9pm on a Tuesday in March, a routine database migration triggered a query that consumed 28GB of RAM — nearly his entire 32GB server. The Linux OOM killer activated. It didn't kill the database. It killed the API server, the frontend, and the authentication service — three random processes it deemed expendable.

Tom woke at 3:07am to his phone buzzing. Customer support tickets were flooding in. His dashboard showed everything as "operational" because Docker reported all containers as running — including the three that were brain-dead but technically hadn't exited. By the time he diagnosed the issue at 4:15am, he'd lost 5 hours of uptime, received 47 support tickets, and fielded calls from two enterprise clients threatening to cancel their annual contracts.

Root cause: not a single container had a memory limit set. One query. One container. No guardrails. Total cost: an estimated £4,200 in refunds and lost goodwill, plus a weekend of his life debugging. The fix took one line per container: deploy: resources: limits: memory: 512M. Five minutes of configuration would have prevented the entire incident.

After managing hundreds of production containers, the same mistakes keep appearing. In 2026 — with AI agents generating Dockerfiles faster than ever — these mistakes aren't disappearing. They're multiplying.¹

"I've seen Docker :latest break production more times than I can count. With AI agents generating Dockerfiles faster than ever, these mistakes are multiplying." — Stackademic, May 2026

The 5 Mistakes That Cost Real Money

1. No Memory Limits — The Silent Server Killer

Tom's SaaS ran on a 32GB server with 12 containers. None had memory limits set. At 9pm on a Tuesday, a routine database migration triggered a query that consumed 28GB of RAM — nearly the entire server. The Linux OOM (Out of Memory) killer activated. It didn't kill the database container that caused the problem. It killed the API server, the frontend, and the authentication service — three random processes it deemed expendable. The database survived. Everything else died.²

Tom woke up at 3am to a flurry of customer emails. His SaaS was down for 4 hours. Root cause: not a single container had a memory limit. The fix takes one line in your compose file: deploy: resources: limits: memory: 512M. Set it on every container. Start conservative — you can always increase it. But never run production containers without it. Last week we audited our own infrastructure and found 10 containers with no memory limits. We fixed 4 critical ones immediately. The remaining 6 are on the roadmap. If our setup had gaps, yours almost certainly does.³

2. The :latest Tag — Production Roulette

The :latest tag is a moving target. Every time you pull the image, you get whatever the maintainer published most recently — including breaking changes, deprecated APIs, and security patches that change behaviour. A minor version bump from nginx 1.25 to 1.26 can change default cipher suites and break TLS connections. A major version bump can delete your data directory because the new image expects a different volume layout.¹

The fix is simple but tedious: pin every image to a specific digest. Not a version tag — a digest. image:nginx@sha256:abc123def456.... This guarantees you're running exactly the code you tested. When you want to upgrade, you test the new digest in staging first, then update production. Our own infrastructure audit found 17 containers running :latest — including critical services like Redis, Evolution API, and Firecrawl. Every one of those is a potential 3am outage waiting to happen. We're fixing them. You should too.²

3. Running as Root — The One-Line Security Disaster

Most Dockerfiles never specify a user. Docker's default is root. If an attacker compromises your application — through a dependency vulnerability, a misconfigured endpoint, or a phishing attack that reaches your server — they inherit root access to the host. They can read environment variables containing API keys and database passwords. They can modify filesystem permissions. They can install persistence mechanisms that survive container restarts.³

The fix is one line at the end of your Dockerfile: USER 1000. Create a non-root user, switch to it, and run your application with minimal privileges. It doesn't prevent all exploits, but it prevents the worst outcome: an attacker getting root on your host through a compromised container. Our audit found 9 containers running as root. Fixing them requires Dockerfile changes (not just runtime config), so they're on the roadmap. But every NEW container we deploy from now on will use non-root users by default.

4. No Health Checks — Routing Traffic to the Dead

Docker considers a container "running" as long as its main process hasn't exited. It doesn't know if your application is actually responding to requests. Without a health check, your load balancer — Traefik, nginx, HAProxy — keeps sending traffic to a container that's alive but brain-dead. Users get timeouts and 502 errors while Docker reports everything as healthy.¹

Add a health check to every container that serves traffic: HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD curl -f http://localhost/health || exit 1. Docker will automatically stop routing traffic to containers that fail their health check and restart them if they stay unhealthy. For databases, check connectivity. For workers, check queue depth. For APIs, check the /health endpoint. Without this, your load balancer is flying blind.

5. Logging to stdout Without Rotation — The Disk Filler

Docker captures all stdout and stderr output from your containers and stores it indefinitely. A busy web server can generate gigabytes of logs per day. A container that crashes and restarts in a loop can fill a disk in hours. When the disk fills, everything stops — database writes fail, API requests are rejected, file uploads break. The server doesn't crash gracefully. It suffocates.³

Configure log rotation in your compose file: logging: driver: json-file options: max-size: 10m max-file: 3. This keeps the last 30MB of logs per container — enough for debugging, not enough to fill a disk. Better yet: ship logs to an external service (Loki, CloudWatch, Datadog) and set Docker to discard local logs entirely. Never let a production server run without log rotation.

🛡️ We'll audit your Docker setup for all 8 mistakes. Takes 5 minutes.

Free Docker Audit →

"My Developer Handles This" / "We Use Managed Hosting"

Tom's SaaS was on a managed hosting platform. He assumed they handled infrastructure. They didn't. Managed hosting manages the hardware — not your Docker configuration. Memory limits, health checks, non-root users, and log rotation are your responsibility. Your developer probably set up the containers to work, not to be production-safe. Working and safe are different things.

We audited our own infrastructure tonight and found 17 containers running :latest tags, 10 without memory limits, and 9 running as root. We're eating our own dog food — and we found gaps. If our setup has these issues, yours almost certainly does too.

The Full Checklist

Check	Why
Pinned digests (not :latest)	Prevent surprise breakages
Memory limits set	Prevent OOM cascades
Non-root user	Prevent privilege escalation
Health checks configured	Prevent routing to dead containers
Log rotation enabled	Prevent disk exhaustion
Read-only root filesystem	Prevent runtime tampering
No secrets in ENV	Prevent secret leakage via docker inspect
Restart policy: unless-stopped	Survive host reboots

What We Learned Auditing Ourselves

Before publishing this post, we ran the full 8-point checklist against our own 57 production containers. We found 17 :latest tags (zero pinned digests), 10 containers with no memory limits, and 9 running as root. We fixed 4 critical memory limits immediately — orchestrator-api (1GB), sovael-ai (256MB), litellm (512MB), orchestrator-worker (512MB). The remaining 6 are on our next maintenance window.

The lesson: infrastructure debt is invisible until you look. We thought our setup was solid. The audit took 30 seconds per check. We'd never run it before. If our infrastructure had these gaps, yours almost certainly does. The full audit — with actual data and copy-paste commands — is published at our Docker Infrastructure Audit post.

Sovael Infrastructure Audit: £297

What You'd Pay Elsewhere	Cost	Sovael
DevOps contractor (half day)	£400-800	Included
Security consultant review	£500-1,500	Included
Total	£900-2,300	£297

And once your infrastructure is solid, it connects to the broader Sovael platform. The same Docker health checks that prevent 3am crashes feed into our monitoring dashboard. The same authentication patterns that secure your containers secure your customer data. Infrastructure isn't separate from your business — it's the foundation everything else runs on. One audit. One report. One less thing that can wake you up at 3am.

🛡️ Five minutes. Eight checks. One report. Free.

Get Audited →

Sources

Stackademic — "Docker in Production Is Still Broken in 2026" (May 2026)
Bioquro — "Docker Best Practices for Production 2026" (May 2026)
Cloudzy — "Top Docker Security Mistakes 2026" (Apr 2026)