AI and other ramblings

Can Transformers Actually Do Causal Inference? Part 4: How to Use Them Anyway

LLMs don't do real causal inference. They pattern-match. They confuse correlation with causation. They fail when you perturb the examples they've memorized.

But that doesn't mean they're useless for causal work. You just have to know where to deploy them.

The punchline: treat LLMs as powerful interfaces to causal workflows, not as reliable engines of causal inference. They're good at hypothesis generation, coding, explanation, and connecting text to proper causal tools. They're brittle and untrustworthy when asked to be the causal estimator.


Don't let the LLM be your causal engine

This is the core mistake. You ask GPT-4: "Does X cause Y?" It gives you a confident answer. You act on it.

Bad idea.

Empirical evaluations show that LLMs solve causal tasks through pattern recall and surface heuristics. Performance drops sharply when benchmarks are refreshed or require genuine intervention reasoning. The model isn't reasoning about causation — it's reciting what sounded causal in its training data.

For any decision that needs identification-quality causality — pricing changes, policy rollouts, safety-critical systems — never rely on "the LLM says A causes B."

Instead: use explicit causal estimators (difference-in-differences, instrumental variables, regression discontinuity, structural causal models, causal forests, doubly robust ML) as the ground truth engine. Let the LLM orchestrate and explain them.

This is the architecture emerging in urban causal agents and IJCAI's LLM-for-causal-discovery work: LLMs sit on top of proper causal machinery. They're imperfect expert systems, not oracles.


Where LLMs actually help

The research points to four roles where LLMs create real value in causal workflows:

1. Hypothesis generator and critic

LLMs are good at proposing plausible treatments, outcomes, and confounders from text. Give them a problem domain and they'll generate candidate causal graphs, surface variables you missed, and poke holes in your assumptions.

This is brainstorming, not truth. But it's useful brainstorming.

2. Data wrangler

Messy data is the norm. LLMs can map chaotic schemas and documentation into clean variable definitions, treatment flags, and time windows suitable for causal estimators. They're good at the translation layer between "what the database calls things" and "what the causal model needs."

3. Method router

Given data structure and constraints, LLMs can suggest candidate identification strategies — "this looks like a diff-in-diff setup" or "you might need an instrument here" — and generate boilerplate code. Not blindly trusted, but a useful starting point.

4. Explanation and communication layer

This might be where they add the most value. Causal graphs and regression outputs are hard to explain to operators, policymakers, and business users. LLMs can turn technical outputs into narratives — grounded in the actual causal model, not hallucinated from vibes.

The pattern: high-leverage, low-authority roles. Productivity gains without handing the LLM epistemic control over causal conclusions.


The architecture: neuro-symbolic causal agents

Recent work is converging on hybrid designs that explicitly separate what the LLM does from what proper causal models do.

Observability agents pair LLMs with causal graphs and abductive inference engines. The agent asks counterfactual questions of the graph — "if service X had not failed, would these logs change?" — rather than hallucinating root causes.

UrbanCIA-style systems decompose the pipeline into specialized agents (hypothesis, data, experiment, validator). Only some are LLMs. Causal validity is enforced by a separate validator phase running standard estimators.

"Bridging LLMs and causal world models" advocates learning a separate causal world model — from structured data or causal representation learning — that the LLM queries for planning. The world model isn't inside the LLM. It's external, explicit, auditable.

For deployment, this means: budget engineering effort for a real causal layer (graphs, estimators, simulators) and treat the LLM as an adaptive front-end to that layer.


Guardrails are non-optional

Benchmarks show LLMs can look strong on familiar causal questions and collapse when graphs are larger, data is textual, or interventions are more complex. If your agent makes operational decisions, you need:

Mode separation. Clear distinction between "speculation" (brainstorming hypotheses) and "commitment" (driving an action based on an estimate). Different prompts, different UI, different accountability.

Red-teaming and counterfactual tests. Deliberately perturb assumed causal relations in a simulator or SCM. Check whether the agent's recommendations change in the right direction. If you flip a treatment effect and the agent doesn't notice, you have a problem.

Provenance and traceability. Force the agent to output the causal graph and the estimator used, not just a natural-language claim about causality. Humans or other tools need to audit it.

Without this, agent behavior will be driven by whatever correlations the LLM finds linguistically plausible. That's exactly what you don't want when interventions are costly.


Concrete heuristics

If you're building agents for retail, operations, or policy-adjacent work:

Mine domain knowledge, then validate externally. Use agents to extract candidate mechanisms and variables from papers, internal docs, and logs. Feed those into explicit causal estimators for uplift, pricing, or policy effects. The LLM proposes; the estimator disposes.

Explain, don't decide. In dashboards or incident systems, let the agent explain causal results and suggest next experiments. Require backing from a causal model before any automated action — price change, rollout, remediation.

Separate the simulator from the policy. For world-model-style agents (simulating shopper behavior, policy effects, etc.), the simulator is one thing; the LLM policy/explainer is another. The LLM operates in the simulated causal world. It doesn't define it.


The bottom line

LLMs become trustworthy for causal work when you give them a real causal backbone and demote them from "oracle of causality" to "causality-aware operator interface."

They're not going to replace your econometrician or your causal inference pipeline. But they can make that pipeline faster to build, easier to use, and more accessible to people who aren't statisticians.

That's not nothing. It's just not magic.


Sources: