AI · Automation · Engineering

How I Build Systems That Run While I Sleep

By Lazar MilicevicJune 24, 202610 min read

Dark server room with glowing racks representing automated systems running autonomously around the clock

The first time one of my automation pipelines woke me up at 3 a.m. with a Slack alert, I learned something important: the alert itself was the success. The system had caught its own failure, rolled back cleanly, and was waiting for me. It hadn't corrupted data. It hadn't double-published. It hadn't sent a half-finished article to production. It had just stopped and told me why.

That's the bar I hold every unattended system to now. Autonomous doesn't mean unsupervised. It means the system knows how to fail safely without me in the loop.

This post is the pattern I actually use, drawn from building my BizFlowAI ContentStudio - an autonomous content and SEO machine that researches, writes, optimizes, and publishes across multiple sites without me touching it.

The four properties every unattended system needs

If a system is going to run while I sleep, it needs four properties, in this order: scheduling, idempotency, observability, and graceful failure. Skip any one and you'll wake up to a mess.

Most engineers reach for the exciting parts first - the LLM call, the agent loop, the clever prompt. I do the opposite. I build the boring scaffolding first, prove it works with a dummy job, then plug in the interesting logic. The boring parts are what let me actually leave it alone.

Here's how each property shows up in practice.

Property	Question it answers	What it looks like
Scheduling	When does work happen?	EventBridge cron, queue depth triggers, debounced events
Idempotency	What if this runs twice?	Deterministic keys, upserts, "already done" checks
Observability	What happened and why?	Structured logs, run records in Postgres, alerts on anomalies
Failure handling	What happens when it breaks?	Retries with backoff, dead-letter queues, circuit breakers

The rest of this post is what each of these means when you're actually shipping.

Scheduling: pick the trigger that matches the work

The default instinct is cron. Cron is fine for the simplest cases, but most real pipelines have at least three different kinds of work, and each wants a different trigger.

In my content machine, there are three:

Research and topic discovery runs on a fixed schedule - once a day at 04:00 UTC. It's a steady, predictable workload. EventBridge cron is perfect.
Article generation is event-driven. When a new validated topic lands in the queue, a Lambda picks it up. There's no point running on a clock; the work either exists or it doesn't.
Publishing and indexing pings are debounced. When an article is ready, I wait 90 seconds before publishing to allow last-second edits or quality-gate overrides. Then I fire.

The mistake I see most often is using cron for everything. You end up with a 5-minute polling job that does nothing 95% of the time and floods your logs. Worse, when the work is there, you wait up to 5 minutes to start. Event-driven triggers are almost always cheaper and faster.

One concrete rule: if your scheduled job ever finds zero work to do more than half the time, it should be event-driven.

Idempotency: the property that lets you retry without fear

This is the one that separates production systems from prototypes. Idempotency means running the same operation twice produces the same result as running it once.

If your pipeline isn't idempotent, you can't retry safely. And if you can't retry safely, you can't run unattended - because every transient failure becomes a manual recovery.

Here's the pattern I use everywhere. Every unit of work gets a deterministic ID derived from its inputs, not a random UUID:

import hashlib

def job_id(topic: str, site: str, date: str) -> str:
    raw = f"{site}::{topic}::{date}".lower().strip()
    return hashlib.sha256(raw.encode()).hexdigest()[:16]

Before any expensive step (an LLM call, a publish, a paid API), I check a runs table in Postgres:

INSERT INTO runs (job_id, step, status, started_at)
VALUES ($1, $2, 'running', now())
ON CONFLICT (job_id, step) DO NOTHING
RETURNING id;

If RETURNING gives me nothing, the step has already started or finished. I skip it. If it gives me a row, I own the work.

This single pattern - deterministic IDs plus a runs table with a unique constraint - is what makes the difference between a system I can leave alone and one I can't. It costs about an hour to add and saves dozens of hours of cleanup.

The expensive things to make idempotent in an LLM pipeline:

LLM generations - cache by hash(prompt + model + temperature). A retry should not cost another $0.40.
External API writes - WordPress publishes, CMS updates, search-console pings. Always check "does this slug already exist?" before creating.
Vector store inserts - upsert by document hash, never blind insert. Otherwise duplicates poison your RAG retrieval.

Observability: every run leaves a paper trail

If something goes wrong at 3 a.m. and I can't reconstruct what happened from logs alone, the system has failed me - even if it technically recovered.

I run two layers of observability:

Layer 1: structured logs. Every log line is JSON with at minimum job_id, step, duration_ms, status, and cost_usd if an LLM was called. CloudWatch is fine for this; so is anything that can query JSON.

Layer 2: a runs table in Postgres. This is the source of truth for "what is this system actually doing?" Each row is one step of one job, with start time, end time, status, error message, retry count, and a JSONB blob of inputs/outputs.

The reason I keep a database table on top of logs is simple: I can query it. A dashboard built on SELECT status, count(*) FROM runs WHERE started_at > now() - interval '24 hours' GROUP BY status tells me more in three seconds than scrolling through CloudWatch ever will.

A few things I always track:

Cost per run. LLM bills get out of hand fast when nobody's watching. I want to know that yesterday cost $4.20 and not $42.
Latency p50/p95. A model slowdown is often the first sign of an upstream issue.
Retry count distribution. If "average retries to success" creeps up, something is degrading even if everything still "works."

According to Google's SRE Workbook, the four golden signals are latency, traffic, errors, and saturation. For an unattended LLM pipeline I'd add a fifth: cost. It's the one that bites hardest if you miss it.

Failure handling: assume everything will break

Every external dependency in my content pipeline has failed at least once in production. Anthropic API timeouts. WordPress REST endpoint returning 502. A vector DB connection pool exhausted. A scraped page that suddenly returns 800 KB of JavaScript instead of HTML.

The question is never "will this fail" but "what does the system do when it does."

My layered approach:

1. Retry with exponential backoff and jitter

For transient errors (5xx, timeouts, rate limits), I retry with backoff: 1s, 4s, 16s, 60s, with jitter to avoid thundering-herd retries. Three retries max for most steps. After that, the job goes to a dead-letter queue.

2. Circuit breakers for upstream services

If the LLM provider returns errors on 5 consecutive calls within 2 minutes, I trip a breaker and pause new generations for 10 minutes. This stops me from burning through retries against a provider outage and stops me from spending money on doomed calls.

3. Dead-letter queues with context

A failed job doesn't disappear - it lands in a DLQ table with the full input payload, the last error, the retry history, and a manual-retry endpoint. Most mornings the DLQ is empty. When it isn't, I can replay a job with one button.

4. Quality gates as a kind of failure

This is the one most people miss. In a content pipeline, "the article was generated successfully" doesn't mean "the article is good." I run automated checks - factual claims have sources, length is in range, no prompt-leak phrases, readability score above a threshold. If the gate fails, the article goes to a review queue, not to publish. That's a failure too, just a softer one.

5. Kill switches

Every system I run has a feature flag in a config table that says enabled: true/false. If something is going wrong and I don't yet understand what, I flip the flag. Within one cron cycle the system stops touching anything. This has saved me twice.

A walk through one real job

Here's what actually happens when my content machine generates one article, end to end. This is the kind of detail that's hard to find in framework docs because it only matters once you've shipped.

EventBridge fires at 04:00 UTC. Lambda enumerates active sites and pushes one SQS message per site.
Each message triggers a research Lambda. It computes job_id = hash(site + date), inserts a runs row, and either does the research or exits if the row already existed.
Research output is upserted into Postgres. Each candidate topic gets its own deterministic ID. Duplicates from a previous run silently no-op.
A separate worker reads candidate topics that passed scoring thresholds and queues generation jobs. Each generation job ID is hash(site + topic_slug + date).
Generation calls Claude with the article prompt, streams the response, validates JSON shape, and writes to Postgres. Token usage and cost get logged.
A quality-gate Lambda runs checks. If it passes, the article moves to status = ready. If not, status = needs_review and I get a Slack ping in the morning.
A publisher polls for status = ready articles. It publishes via the CMS API using the slug as the idempotency key. If the slug already exists, it updates instead of duplicating.
A final Lambda pings the search console for indexing and writes the published URL back to the runs table.

Every one of those steps is independently retryable. If step 6 crashes after step 5, I haven't lost the generated article and I won't pay to generate it again. That's the whole point.

What I'd do if I were starting over today

If I were building my first unattended system from scratch in 2026, here's the order I'd follow. Resist the urge to skip steps even though they feel boring.

Postgres first. Set up the runs table and the deterministic-ID pattern before you write any business logic. One hour of work, saves you weeks.
One trigger, one queue, one worker. Start with the simplest topology that works. Don't reach for Step Functions or orchestration frameworks until you have at least three workers that actually need coordinating.
Log cost on every LLM call. Even if you're "just testing." The habit matters more than the data.
Build the kill switch on day one. Before you write the first production-bound feature.
Test failure modes before you trust the system. Turn off the network for a minute. Kill the database mid-job. Send a malformed payload. If the system recovers cleanly, you can sleep. If not, you can't.
Use boring tools. EventBridge, SQS, Lambda, Postgres. The exciting parts of your system should be the LLM logic and the data, not the plumbing.

The honest reason most "AI agent" demos fail in production isn't the model. It's that the surrounding system has none of these properties. The agent works fine until something flickers, and then it doesn't, and nobody knows why.

Wrapping up

A system you can leave alone is a system you've taught how to fail. Scheduling decides when work happens. Idempotency means you can retry. Observability means you'll know what happened. Failure handling means a flicker stays a flicker instead of turning into a 6 a.m. cleanup.

If you're building something that needs to run unattended - a content pipeline, a data sync, a multi-agent workflow, an internal automation - and you want a second pair of eyes on the architecture before you ship it, I'm happy to take a look. You can reach me at lazar-milicevic.com/#contact, or browse the rest of the blog for more on how I build these systems day to day.

Frequently asked questions

What four properties does an unattended automation system need to run reliably?

Every system I leave running without supervision needs four properties, in this order: scheduling, idempotency, observability, and graceful failure handling. Scheduling answers when work happens (cron, queue triggers, or debounced events). Idempotency ensures running the same operation twice produces the same result as once. Observability gives you a paper trail through structured logs and a runs table. Graceful failure means retries with backoff, dead-letter queues, and circuit breakers so transient errors don't become manual recovery jobs. Skip any one and you'll wake up to a mess.

When should I use cron versus event-driven triggers for automation pipelines?

Use cron only for steady, predictable workloads like a daily research job at 04:00 UTC. Use event-driven triggers when work arrives unpredictably, like processing a new item the moment it lands in a queue. My concrete rule: if your scheduled job finds zero work to do more than half the time, it should be event-driven instead. Event-driven triggers are almost always cheaper and faster than polling, because polling floods your logs and adds latency before work even starts. Debounced events are a third option when you want to wait briefly (e.g. 90 seconds) before acting.

How do I make an LLM pipeline idempotent so I can retry safely?

I give every unit of work a deterministic ID derived from its inputs (e.g. a SHA-256 hash of site, topic, and date) rather than a random UUID. Before any expensive step, I insert into a Postgres runs table with a unique constraint on (job_id, step) using ON CONFLICT DO NOTHING - if no row is returned, the step already ran and I skip it. I also cache LLM generations by hash(prompt + model + temperature) so retries don't cost another API call, check for existing slugs before publishing to a CMS, and upsert vector store entries by document hash. This pattern takes about an hour to add and saves dozens of hours of cleanup.

Why should I keep a runs table in Postgres if I already have logs?

Logs are great for forensic detail, but a runs table is queryable, which is what you need at 3 a.m. or for daily health checks. I store one row per step of each job with start time, end time, status, error message, retry count, and a JSONB blob of inputs and outputs. A single SQL query like SELECT status, count(*) FROM runs WHERE started_at > now() - interval '24 hours' GROUP BY status tells me more in three seconds than scrolling CloudWatch ever will. It becomes the source of truth for what the system is actually doing, while structured JSON logs handle the per-line detail.

What does it mean that 'autonomous doesn't mean unsupervised' for AI automation systems?

Autonomous means the system knows how to fail safely without a human in the loop - not that no human ever watches it. A well-built unattended system catches its own failures, rolls back cleanly, and alerts you with enough context to act. The alert itself is a success signal: it means the system didn't corrupt data, double-publish, or ship half-finished work. I build the boring scaffolding (scheduling, idempotency, observability, failure handling) before the exciting LLM logic, because that scaffolding is what makes it safe to actually walk away.

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts