Close the Loop: When LLMs Work and When You Need More

03 Feb, 2026

There's a simple question that tells you whether you can trust an LLM for a task:

Can you close the loop?

If you can verify the output — through tests, experiments, iteration, or just looking at it — you're fine. Use the model. Ship fast. Iterate.

If you can't, you're in dangerous territory. That's when you need to think harder about what the model actually knows versus what it's pattern-matching.

This isn't a new insight. It connects to Pearl's causal hierarchy, the Lucas Critique in economics, and recent work on performative prediction in ML. The contribution here is organizing it around a simple question practitioners can ask.

The Loop Hierarchy

Not all feedback loops are created equal. Here's a rough taxonomy:

Tier 1: Tight loop, fast feedback

Use LLMs freely.

Code: The compiler doesn't care about your feelings. Tests pass or fail. You know within seconds whether the output works.
Writing: You read it. You revise. The loop is you, and it's immediate.
Design: You see the mockup. You tweak. Iterate until it looks right.
Data transformation: Output either matches spec or doesn't.

The pattern: errors are cheap and caught quickly. The LLM can be wrong 30% of the time and still be massively useful because you're catching and fixing those errors in real-time.

This is where most LLM productivity gains live today. And it's why "vibe coding" works — not because the model is always right, but because you can tell when it's wrong.

Tier 2: Loop exists, but slower and noisier

Use LLMs with monitoring.

Recommendations: Users click or don't, but the signal is delayed and confounded. Did they click because your recommendation was good, or because it was at the top of the page?
Search ranking: Engagement metrics exist, but they measure what users did, not what they wanted.
Customer support: Tickets get resolved, but was that the AI or the human who escalated?
Content moderation: You catch some errors through reports, but false negatives are invisible.

The pattern: feedback exists but it's noisy, delayed, or partially observable. You can learn and improve, but you need to be careful about what you're learning. Easy to optimize for the metric while missing the point.

This is Goodhart's Law territory: "When a measure becomes a target, it ceases to be a good measure." Your feedback loop exists, but it might be teaching you the wrong lessons.

Tier 3: Loop exists, but confounded

You need causal reasoning to interpret your feedback.

Marketing attribution: Conversions happen. But what caused them? The ad? The brand? The fact that people who click ads were going to buy anyway?
Pricing decisions: Sales change when you change prices. But was it the price, or seasonality, or a competitor's move, or economic conditions?
Hiring: Some employees succeed, others don't. But your feedback is polluted by selection bias — you only see outcomes for people you hired.
A/B tests with interference: Your test says variant B wins. But users in group A saw their friends using B, and that affected their behavior.

The pattern: you have feedback, but it will actively mislead you if you take it at face value. The loop is lying to you because correlation isn't causation, and the world is full of confounders.

This is where you need the econometrics toolbox — or at least awareness that your intuitive read of the data might be backwards. Perdomo et al.'s work on performative prediction formalizes exactly this problem: when your predictions influence the outcomes you're measuring, standard ML assumptions break down.

Tier 4: No loop at all

You need causal reasoning to make the decision.

Policy: You implement a regulation. You can't observe the counterfactual world where you didn't.
Strategic bets: You enter market A instead of market B. You'll never know what would have happened in B.
Medical treatment: The patient gets the drug or the placebo. Not both.
One-shot decisions: Launch timing, acquisition targets, pivots. No do-overs.

The pattern: there is no feedback loop. The counterfactual is unobservable. You have to reason about what would happen under different choices, because you can't measure it.

This is the hardest case, and it's where most high-stakes decisions live. It's also the top of Judea Pearl's "ladder of causation" — the level of counterfactual reasoning that requires imagining worlds that never existed.

Where Causality Fits In

Here's the insight: causality is what you need when you can't close the loop cleanly.

If you can run a proper randomized experiment, you're closing the loop on causality directly. The experiment is the causal reasoning — you don't need to think about confounders because randomization handles them.

The entire field of causal inference exists for situations where you can't run the experiment:

Diff-in-diff: "I can't randomize, but I have before/after data and a comparison group."
Instrumental variables: "I can't randomize, but I have something that affects treatment but not outcome."
Regression discontinuity: "I can't randomize, but there's an arbitrary cutoff I can exploit."

These are all ways of extracting causal signal when the loop is confounded or missing.

Robert Lucas made this point in economics fifty years ago: models trained on historical correlations fail when policy changes the underlying system. The correlations you learned were equilibrium outcomes, not laws of nature. Change the incentives, and the equilibrium shifts.

The Practical Upshot

When someone asks "can I use an LLM for this?", the real question is:

How tight is my feedback loop, and how clean is the signal?

Loop quality	What to do
Tight and fast	Ship. Iterate. Let the LLM cook.
Slow but clean	Use with monitoring. Build evaluation.
Confounded	You need causal reasoning to learn correctly.
Missing	You need causal reasoning to decide at all.

Most LLM wins today are in the top row. That's fine — there's a lot of value there.

The danger is when people drift down the table without noticing. When they take LLM outputs that feel authoritative and use them for tier 3 or tier 4 decisions. When the loop is broken but the confidence is high.

The LLM's Role Changes By Tier

This isn't about "LLMs good" or "LLMs bad." It's about what role they should play:

Tier 1: LLM as doer. It writes the code, drafts the text, generates the options. You verify and ship.

Tier 2: LLM as assistant with monitoring. It handles the task, but you're watching metrics and catching drift.

Tier 3: LLM as analyst, not oracle. It can help you think about confounders, generate hypotheses, explain results. But the causal conclusion comes from proper methods, not from asking the model "what caused X?"

Tier 4: LLM as thought partner. It helps you structure the decision, surface considerations, stress-test assumptions. But the reasoning is yours, informed by whatever causal evidence you can gather.

The mistake is using a tier-1 workflow for a tier-4 problem. That's when people get burned.

The Bottom Line

The question isn't "is this LLM smart enough?"

The question is: "Can I close the loop?"

If yes: ship fast, iterate, trust the process.

If no: slow down. Figure out what kind of feedback you have, how confounded it is, and what you need to reason correctly about cause and effect.

The models keep getting better. But the loop problem doesn't go away. No matter how good GPT-N gets, you still can't observe counterfactuals, and you still have to think about what would have happened when you're making decisions that change the world.

That's not a limitation of AI. It's a feature of reality.

This post is a practical coda to my series on transformers and causal inference. The short version: LLMs can learn some causal structure, but it's fragile, and for high-stakes decisions you need more than pattern matching. This post is about knowing when you're in that territory.

References:

Perdomo et al., "Performative Prediction" (ICML 2020) — formalizes when predictions influence outcomes
Pearl & Mackenzie, The Book of Why (2018) — the ladder of causation: association → intervention → counterfactuals
Lucas, "Econometric Policy Evaluation: A Critique" (1976) — why correlations break when policy changes
Liu et al., "Delayed Impact of Fair Machine Learning" (ICML 2018) — feedback loops in ML affecting populations

Tags: AI, LLMs, Causality, Decision Making, Feedback Loops, Machine Learning