Owned in Production: How AI Can Be Your Best Friend or Your Worst Enemy

Posted on Apr 3, 2026

Last week I introduced ClickNBack and ended with a tease: the system is live, exposed to the real internet, and I learned what that means the hard way. Here’s that story.

What happened next is embarrassing in the way that’s worth sharing. I was so eager to ship the system and keep evolving it that I didn’t check the metrics dashboard for several days. I was heads-down on features, on iterating, on the momentum of having something live. Monitoring felt like overhead when there was code to write. Looking back, that’s a decision I won’t repeat.

An engineer notices a spike in CPU metrics on a monitor, realizing their production system has been compromised.

Three Misconfigurations That Don’t Announce Themselves

When I finally headed to the dashboard, there it was. A familiar, unsettling pattern: sustained 100% CPU and 90% memory utilization, recurring after every reboot. What I found had been running continuously for approximately eight days.

ClickNBack runs on a single VPS with the full Docker Compose stack. When I deployed it, I was in “ship it” mode—focused on getting the system live and reachable. After all, it was just a demo system, right? Three independent misconfigurations combined to offer zero resistance against an automated attack. None of them advertised themselves.

The database port was published to all network interfaces, including the public IP. There’s no reason for a database to be internet-facing—the application connects through the internal Docker network. But Docker’s port publishing bypasses UFW entirely via iptables, and this is a well-documented gotcha: firewall rules can be perfectly configured and still be irrelevant for Docker-published ports, because Docker operates at a layer below UFW. The only safe default is to never bind a sensitive port to 0.0.0.0.

The second issue: default credentials, straight from the example config file. I even included a comment warning against using them in production. The excitement of having it in production quickly made me to overlook my own advice LOL. Automated bots tried the most common credential pairs within hours of the container going live, and found a match on the first night.

Third: the database user held full superuser privileges—inherited from how the official PostgreSQL Docker image initializes accounts by default. Superuser access in PostgreSQL enables COPY TO PROGRAM, a legitimate built-in feature that executes arbitrary shell commands from within a database session. With a brute-forced login and superuser access, that’s remote code execution via SQL. No CVE required.

The attacker was a well-known automated botnet. It didn’t target me specifically—it scanned for the exposed port, found it, tried common credentials, authenticated on the first attempt, and installed a Monero miner that consumed 92% of the CPU for eight consecutive days.

Thirty Minutes from Detection to Fixed

Once I started investigating, the diagnosis was immediate. A look at running processes surfaced the culprit. Tracing backward through the process tree confirmed the parent: the database container. A few more checks confirmed all three root causes.

What I deliberately did not do was reboot. Rebooting a compromised system without fixing the underlying vulnerabilities isn’t remediation—it’s delay. Looking at the metrics history, I could see the pattern: the miner had survived multiple reboots, each time re-establishing itself within less than an hour of the system coming back up. Just for curiosity, I even tested this during investigation by rebooting once more—and there it was again, right on schedule. The botnet wasn’t waiting around; it was simply re-running the same exploit every time the port came back up. Repeating reboots would have been even more embarrassing than this whole situation already was.

The fix had to be layered and sequenced: remove the exposed port first, rotate all credentials that had lived in the compromised environment, revoke the superuser privilege from the application database account, then restart the stack clean. Thirty focused minutes from first suspicion to a clean bill of health.

The Post-Mortem Is What Actually Matters

The technical remediation was the straightforward part, specially with AI asistance. What mattered more was writing the incident report.

A proper post-mortem isn’t a confession or a performance of accountability. It’s a structured analysis: timeline, root causes, contributing factors, remediation actions, and—most importantly—the systemic changes that prevent recurrence. The point isn’t to document what went wrong; it’s to identify the conditions that allowed it and close them permanently.

In this case, the real root cause wasn’t any of the three technical misconfigurations. It was the absence of a security hardening phase in the deployment runbook. No checklist item for firewall rules. No constraint on port bindings. No mandatory step to rotate example credentials before go-live. The config file had a warning embedded in it—but a comment in a file is not a process gate.

Expertise First, Then AI

Here’s the part that’s a little embarrassing—and worth sharing precisely because of that. That runbook was drafted with AI assistance. Infrastructure and DevOps aren’t my strongest domains, and I leaned on AI to cover that gap. The problem with using AI in an area where you lack expertise is that you don’t always know what you should be asking. I got a solid-looking runbook that covered everything I thought to ask about, and that it proved to work quite nicely. But the security hardening phase I didn’t ask about didn’t make the cut. The irony writes itself.

And here’s the flip side, though: that same AI helped me resolve the incident in thirty minutes. That’s not a contradiction—it’s the lesson. When I was diagnosing the incident, I was driving, focused on the task, not rushing things. I knew what symptoms to look for, which hypotheses made sense, what a clean result looked like. Working from that foundation, AI accelerated every step. Drafting the runbook from scratch, with no equivalent foundation, I got answers to the questions I asked—and silence on everything I didn’t know or remember to ask.

AI amplifies your expertise, but it can’t substitute for it. Point it at a domain you understand and it makes you faster. Point it at one where you have blind spots, and it makes those blind spots invisible. The senior skill isn’t using AI—it’s knowing when your judgment is strong enough to trust the output. And how to arise those blind spots.

The runbook now has a mandatory security phase that must complete before DNS points at the server. Hard constraints, not suggestions. That’s the output that actually matters: not a chronicle of how things broke, but a changed process that closes the gap permanently.

Anyway, production systems have incidents. The question isn’t whether you’ll encounter one—it’s whether how you respond makes the system stronger. This one did.