
PoC to Production: Scale AI Without a Rewrite
PoCs usually fail not because of model quality but because day-one architectural shortcuts become load-bearing by month six. I lock in four things early —
First-person essays on AI automation, GenAI/LLM engineering, cloud architecture and building autonomous systems, by Lazar Milicevic.

PoCs usually fail not because of model quality but because day-one architectural shortcuts become load-bearing by month six. I lock in four things early —

I rebuilt my RAG evaluation after a pipeline scored 0.91 on answer relevancy while hallucinating account numbers in production, and now run a 12-metric

I shipped an LLM judge that scored outputs 4.6/5 when humans rated them 3, so I built a calibration recipe using a 150-sample human-labeled gold set

Most engineers ask how my LLM eval approach differs from Hamel Husain's: we agree on fundamentals like error analysis and sparing LLM-as-judge use, but

I build unattended systems on four properties in order—scheduling, idempotency, observability, and graceful failure—starting with boring scaffolding

I build unattended systems that replace recurring manual work—mapping the real process, automating deterministic parts with code and judgment calls with