AI and other ramblings

Can Transformers Learn Causality? Part 2: The Coherence Illusion or why LLMs look smarter than they are

Last time I covered two papers showing transformers can learn causal structure — both theoretically and empirically.

If that were the whole story, we could relax. But a third paper complicates the picture significantly.


The Myhill-Nerode Test

Vafa, Chen, Rambachan, Kleinberg, and Mullainathan asked a different question in "Evaluating the World Model Implicit in a Generative Model."

Instead of asking "can the model do task X?", they asked: does the model have a coherent world model that would let it do task X and all related tasks?

They formalized this using the Myhill-Nerode theorem from automata theory. The intuition: if a model truly understands a domain, it should recognize when two different sequences lead to the same underlying state. Its behavior should be consistent across equivalent situations.

Think of it like this: if you understand chess, you know that different move sequences can reach the same board position. A model that "understands" chess should treat equivalent positions equivalently.


The Sobering Results

The findings were stark: "their world models are far less coherent than they appear."

Models that performed well on standard benchmarks fell apart when tested on slight variations. They'd succeed on one version of a task and fail on a logically equivalent reformulation.

The apparent competence was masking underlying incoherence.

This isn't about models being "wrong" — they often get the right answer. It's about how they get there. A coherent world model would generalize smoothly. What the researchers found was brittleness: success on A, failure on A', even when A and A' are logically identical.


Reconciling the Evidence

How do we square this with yesterday's papers?

Here's the synthesis:

Transformers can learn causal structure in controlled settings. When the training distribution is well-specified and the task is clean, gradient descent finds solutions that encode genuine structure.

But "can learn" isn't "reliably learns." Real-world training data is messy. The causal structure is implicit. The model has to infer it from surface patterns — and might infer something that works in training but breaks in deployment.

And even when they learn something, it's fragile. Apparent competence can mask shallow understanding. The model might get the right answer for the wrong reasons — reasons that won't generalize.

This isn't a contradiction. It's a capability with caveats.


The Philosophical Question

What does it even mean for a model to "understand" causality?

One view: understanding is behavior. If the model makes correct predictions and generalizes appropriately, it understands. Full stop.

Another view: understanding requires coherence. A consistent internal representation. The ability to explain why, not just predict what.

The research suggests transformers are closer to the first than the second. They exhibit understanding-like behavior in constrained settings. But the behavior is more brittle, more context-dependent, more superficial than we might hope.


Next up: Part 3 — What this means for deploying LLMs in the real world

Paper: