AI · Automation · Engineering

LangChain in Production: What I Keep and What I Rip Out

By Lazar MilicevicJuly 4, 202611 min read

Dark server racks with glowing cables illustrating LangChain production infrastructure

I have a rule for LLM application development after shipping autonomous systems for several years: frameworks earn their place only when the cost of removing them exceeds the cost of keeping them. LangChain is the framework I have wrestled with the most. It gets you to a working prototype in an afternoon, then fights you for the next six months when you try to make it run unattended.

Here is what I actually keep from LangChain in production LLM apps, what I replace immediately, and the architecture decisions that matter more than the framework choice itself.

Where LangChain Earns Its Weight

LangChain shines in the first week of a project. When you need to wire an LLM to a tool, attach a vector store, and prove the core idea works, the abstraction layer saves real time. The document loaders, the text splitters, the prompt templates, the growing catalog of integrations. These are genuinely useful. I still use several of them in BizFlowAI ContentStudio, the autonomous content and SEO system I built.

Specifically, I keep four things:

Document loaders and splitters. The recursive character text splitter is solid. I have benchmarked it against custom splitters and the difference in retrieval quality is negligible for most use cases. Use it, configure the chunk size and overlap for your data, and move on.

Prompt templates. ChatPromptTemplate and its variants are clean. They handle message formatting across providers, which matters when you are testing the same prompt against Claude and OpenAI side by side. The variable interpolation is simple and predictable.

Output parsers. PydanticOutputParser and the structured output parsers are worth keeping if you are not yet using provider-native structured outputs. They give you a retry layer when the model returns malformed JSON.

The integration catalog. If you need to connect to a vector store you have never used, LangChain probably has a wrapper for it. The wrapper might be thin, but it saves you reading the provider docs for a proof of concept.

That is where my endorsement ends. Everything else gets replaced.

Where LangChain Becomes a Liability

The moment your LLM application needs to run without human supervision, the abstractions start working against you. Here are the specific problems I have hit in production and how I solved them.

Chains That Hide Control Flow

LCEL (LangChain Expression Language) looks elegant in a notebook. You pipe components together, it reads like a data pipeline, and it works. Then a step fails in production and you are staring at a stack trace that passes through fifteen internal wrapper classes before it tells you what actually broke.

I had a retrieval chain in ContentStudio that would occasionally fail on the reranking step. The LCEL chain caught the exception, retried silently, and swallowed the error context. The system kept running but returned worse results. I did not notice for two weeks because the failure was silent.

What I do instead. I write the orchestration as plain Python functions. Each step is an explicit function call with a try/except block, structured logging, and a clear return type. I lose the one-liner elegance. I gain the ability to look at a traceback and know exactly which step failed, with what input, and why.

# Instead of this:
chain = retriever | prompt | model | parser
result = chain.invoke(query)

# I do this:
def run_retrieval_pipeline(query: str, config: PipelineConfig) -> RetrievalResult:
    documents = retrieve(query, config)
    if not documents:
        log.warning("No documents retrieved", query=query)
        return RetrievalResult.empty()

    reranked = rerank(documents, query, config)
    context = build_context(reranked)
    response = generate(context, query, config)

    log.info("Pipeline completed",
             doc_count=len(documents),
             tokens_used=response.usage.total_tokens)
    return response

The second version is longer. It is also debuggable at 2 AM when an automated job fails and you have nothing but a log line.

Agent Executors That Swallow Errors

The agent executor abstraction in LangChain was designed for interactive use. You give it a goal, it loops through tool calls, and eventually it produces an answer. In production, this loop is dangerous.

The default agent executor has retry logic, intermediate step storage, and error handling that is difficult to customize without subclassing and overriding internals. When a tool call fails intermittently, the executor retries with a slightly different prompt, the model takes a different path, and you get nondeterministic behavior that is nearly impossible to reproduce.

What I do instead. I build agent loops manually. This is not as much work as it sounds. A tool-calling agent loop is roughly 60 lines of Python. You call the model, check if it wants to use a tool, execute the tool, append the result, and repeat until the model produces a final answer or you hit a step limit.

def agent_loop(
    query: str,
    tools: list[Tool],
    model,
    max_steps: int = 8
) -> str:
    messages = [system_prompt(), user_message(query)]

    for step in range(max_steps):
        response = model.chat(messages, tools=tools)

        if response.stop_reason == "end_turn":
            return response.content

        for tool_call in response.tool_calls:
            result = execute_tool(tool_call, tools)
            messages.append(tool_result_message(tool_call.id, result))

    return fallback_response(query)

That is the entire agent. I control the retry logic. I control the logging. I control what happens at the step limit. If something breaks, the stack trace points to the exact line.

Callbacks and Streaming That Fight Each Other

LangChain's callback system is powerful but complex. Callbacks fire at multiple levels (chain, model, tool), the ordering is not always intuitive, and streaming a response to a user while also logging token usage requires careful callback coordination.

In one project, I needed to stream Claude's output to a UI while simultaneously counting tokens for cost tracking. The LangChain streaming callback and the token usage callback interfered with each other. The token counts were wrong on streamed responses because the callback fired before the stream completed.

What I do instead. For streaming, I call the provider SDK directly. Both the Anthropic and OpenAI Python SDKs support streaming with clean, predictable event structures. I handle the stream, count tokens from the final usage event, and log it. No callbacks, no indirection.

The Architecture Choices That Actually Matter

Whether you use LangChain or not is a minor decision compared to the architecture choices below. These are the patterns I rely on to keep LLM applications reliable in production.

Separate Your Prompts From Your Code

I store prompts as versioned text files, not as Python string literals inside LangChain templates. This lets me iterate on prompts without deploying code, run A/B tests between prompt versions, and diff prompt changes in code review.

The structure is simple:

prompts/
  content_writer/
    v1.txt
    v2.txt
    current.txt -> v2.txt

The application loads current.txt at startup. To test a new prompt, I update the symlink and restart. To roll back, I point it at the previous version. This has saved me from bad prompt deploys more times than I can count.

Build a Retrieval Pipeline, Not a Vector Search

Vector similarity search alone produces mediocre retrieval. Every production RAG system I have shipped uses a hybrid retrieval pipeline with reciprocal rank fusion (RRF).

The pipeline has three stages:

Dense retrieval using pgvector or an external vector store. This catches semantic matches.
Sparse retrieval using BM25 or full-text search. This catches exact keyword matches that dense retrieval misses.
Reciprocal rank fusion to merge and re-rank both result sets.

This pattern consistently outperforms pure vector search in my testing. The improvement is most visible on queries that contain specific entity names, technical terms, or exact phrases.

def hybrid_search(query: str, conn, top_k: int = 10) -> list[Document]:
    dense = dense_search(query, conn, top_k * 2)
    sparse = bm25_search(query, conn, top_k * 2)
    fused = reciprocal_rank_fusion(dense, sparse)
    return fused[:top_k]

I use pgvector for this because it lives inside PostgreSQL alongside the BM25 full-text index. One database, two retrieval methods, no separate vector store to maintain.

Make Every LLM Call Observable

In production, you need to know what went into every LLM call and what came out. I log four things for every call:

Field	Why it matters
Full prompt (including context)	Reproducibility. You cannot debug a failure without the exact input.
Model and parameters	Temperature, max tokens, model version. A model update can change behavior silently.
Full response	The raw output before any parsing.
Token usage and latency	Cost tracking and performance monitoring.

This is not optional. If your logging is sparse, you are flying blind. I use structured JSON logging that goes into a search-capable store, so I can query past calls when investigating an issue.

Handle Failure With Graceful Degradation

Autonomous systems need to handle LLM failures without human intervention. There are three failure modes I plan for:

Rate limits and timeouts. Every LLM call has exponential backoff with a maximum retry count. If retries are exhausted, the system logs the failure and uses a fallback. For content generation, the fallback might be skipping that item and moving to the next one in the queue.

Malformed outputs. Provider-native structured outputs (OpenAI's response_format, Anthropic's tool calling for structured data) are more reliable than asking the model to return JSON in free text. I use these wherever possible. When I cannot, I validate with Pydantic and retry once with a corrective prompt.

Empty or irrelevant retrieval. If the retrieval step returns documents with low similarity scores, I do not pass garbage context to the model. I either broaden the search, fall back to a different retrieval method, or return a default response that acknowledges the gap.

Cost and Token Management in LangChain Apps

LangChain makes it easy to lose track of tokens because the abstractions hide the API calls. A single chain invocation might make three or four LLM calls under the hood (classification, retrieval, generation, validation) and you only see the final result.

In ContentStudio, I track token usage per pipeline run, not per API call. This means wrapping the entire pipeline in a token counter that aggregates across all internal calls.

class TokenBudget:
    def __init__(self, limit: int):
        self.limit = limit
        self.used = 0

    def consume(self, tokens: int) -> bool:
        if self.used + tokens > self.limit:
            return False
        self.used += tokens
        return True

Every LLM call checks the budget before executing. If the pipeline is approaching its limit, it prioritizes the most important calls and skips optional ones. This prevents a single runaway job from costing $50 in API fees because it got stuck in a retry loop.

When I Would Still Choose LangChain

I am not anti-LangChain. I use parts of it in production today. If you are building an internal tool where reliability requirements are moderate, where a human reviews the output before it ships, and where you need to integrate with many data sources quickly, LangChain is a reasonable choice.

The newer LangGraph library is also a step in the right direction. It makes the agent loop explicit, which addresses my biggest complaint about the agent executor. I am watching it closely and have used it on one project where the graph-based control flow was a genuine fit.

For systems that run unattended, where the output goes directly to users or production systems without review, I strip down to the provider SDKs, plain Python, and the few LangChain utilities that are genuinely useful. The result is a codebase that is easier to debug, easier to reason about, and more predictable under failure.

What I'd Do Differently

If I were starting a new LLM application today, here is the path I would take:

Start with the provider SDK directly. Claude or OpenAI, their official Python SDKs are well-documented and straightforward. Build the core feature without any framework.
Use LangChain utilities selectively. Pull in the document loaders, splitters, or specific integrations only when you need them. Do not adopt the full framework.
Build your own agent loop. It is 60 lines of code and you will understand every line.
Use pgvector for retrieval. Keeping vectors in PostgreSQL alongside your other data simplifies your infrastructure enormously.
Log everything from day one. Not as an afterthought. The first time you need to debug a production issue, you will be grateful for the full prompt and response history.

The framework you choose matters less than the architecture around it. Observable calls, graceful failure handling, hybrid retrieval, and versioned prompts are what make an LLM application production-ready. None of those require LangChain, and several of them are harder with it.

I write about these patterns from real systems I have shipped, including the autonomous content and SEO pipeline at BizFlowAI. If you are building an LLM application that needs to run reliably in production, or you have a prototype that needs to survive contact with real users, reach out at lazar-milicevic.com/#contact. You can also find more of my production engineering writeups on the blog.

Frequently asked questions

Should I use LangChain in production LLM applications?

LangChain is excellent for prototyping but becomes a liability in unsupervised production environments. I keep four specific components, document loaders and splitters, prompt templates, output parsers, and integration wrappers, and replace everything else with plain Python. The framework earns its place for document processing and provider-agnostic formatting, but its orchestration abstractions create debugging nightmares that aren't worth the convenience once you need reliability.

What's wrong with LangChain Expression Language (LCEL) for production use?

LCEL pipes components together elegantly in a notebook but creates serious debugging problems when something fails in production. I had a retrieval chain where the reranking step failed silently, the LCEL chain caught the exception, retried, and swallowed the error context, so the system returned worse results for two weeks without me noticing. I now write orchestration as plain Python functions with explicit try/except blocks, structured logging, and clear return types so I can trace exactly which step failed and why.

How do you build an AI agent loop without LangChain's agent executor?

A custom tool-calling agent loop is roughly 60 lines of Python: you call the model, check if it wants to use a tool, execute the tool, append the result, and repeat until the model produces a final answer or you hit a step limit. LangChain's agent executor has opaque retry logic and error handling that causes nondeterministic, irreproducible behavior when tools fail intermittently. Building the loop manually gives you full control over retries, logging, and step limits, and stack traces point to the exact line that broke.

Which LangChain components are actually worth keeping?

I keep four LangChain components in production: the recursive character text splitter (which I benchmarked against custom splitters with negligible retrieval quality difference), ChatPromptTemplate for clean cross-provider message formatting, PydanticOutputParser for structured output retry logic if you're not using provider-native structured outputs, and the integration catalog for quickly connecting to unfamiliar vector stores or APIs. These are well-built, save real development time, and don't obscure control flow.

Why does LangChain fail for autonomous unattended LLM systems?

LangChain's abstractions are designed for interactive use where a human can notice and intervene when something goes wrong. In autonomous systems, the chains hide control flow behind layers of wrapper classes, the agent executor swallows errors and retries unpredictably, and the callback system creates ordering complexity that makes streaming unreliable. When a system runs without supervision, you need explicit error handling, deterministic behavior, and stack traces that point to exact failure points, none of which LangChain's abstractions provide out of the box.

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Work with me →

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts