Can Transformers Learn Causality? Part 1: The Case For

24 Jan, 2026

Last time I wrote about whether transformers can escape the Lucas Critique — the idea that models trained on historical correlations fail when policy changes the underlying system.

That post raised a deeper question: Can transformers learn causal structure at all, or are they just very good at pattern matching?

Two recent papers make a surprisingly strong case that yes, they can.

The Theoretical Argument

Nichani, Damian, and Lee tackled this mathematically in "How Transformers Learn Causal Structure with Gradient Descent."

Their setup: train a simplified two-layer transformer on sequences with latent causal structure. Then prove what the model actually learns.

The key finding: the gradient of the attention matrix encodes mutual information between tokens. Through the data processing inequality, this lets the model identify edges in the causal graph.

In plain English: when you train a transformer with gradient descent, the attention mechanism doesn't just learn correlations. It learns which variables are causally connected to which.

As a special case, they showed that transformers learning from Markov chains develop "induction heads" — a mechanism that recognizes when pattern A followed by B predicts B will follow A again. This isn't memorization. It's learning the generative structure.

The Empirical Evidence

Garg, Tsipras, Liang, and Valiant asked a simpler question in "What Can Transformers Learn In-Context?"

They trained transformers from scratch on various function classes — linear functions, sparse linear functions, two-layer neural networks, decision trees — and tested whether the models could learn new instances purely from in-context examples.

The results: transformers matched the performance of optimal least squares estimators.

Even more impressive: the models worked under distribution shift. Train on one distribution, test on another, and performance held. This suggests something more robust than curve-fitting.

What This Means

If these papers were the whole story, we could declare victory. Transformers can learn causal structure. Theory proves it's possible. Experiments show it happens.

But they're not the whole story.

Next time I'll cover the paper that complicates this picture — research showing that even when transformers appear to understand, their world models are "far less coherent than they appear."

The answer to "can transformers learn causality?" turns out to be: yes, but with caveats that matter.

Papers: