The Unreasonable Effectiveness of Deep Learning

In 1960, Eugene Wigner wrote his famous essay on “The Unreasonable Effectiveness of Mathematics in the Natural Sciences.” His central wonder: why should abstract mathematical structures, invented by humans playing with symbols, describe the physical universe so precisely?

I think we’re living through a parallel moment. Only this time the question is: why should gradient descent on billions of parameters produce something that reasons?

What Wigner saw

Wigner’s amazement was specific. He wasn’t just saying “math is useful.” He was pointing at something deeper — that mathematical concepts developed in total abstraction (complex numbers, group theory, Riemannian geometry) kept turning out to be exactly the right language for physics. Not approximately right. Exactly right.

The punchline of his essay: “The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve.”

The deep learning parallel

Now consider what we’ve built. Take a neural network — a parametric function $f_\theta : \mathbb{R}^n \to \mathbb{R}^m$ — and optimize it over a loss landscape with stochastic gradient descent. The loss surface has dimension in the billions. It’s non-convex. Classical optimization theory says this should be hopeless.

And yet.

These models learn to translate languages. They learn to write code. They learn to prove theorems. A single architecture — the transformer — with a single objective — next token prediction — gives rise to behavior that looks unreasonably like understanding.

$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)$

This update rule, iterated millions of times, somehow navigates a loss landscape of incomprehensible dimension and finds parameters that generalize. Not just memorize — generalize.

What we can’t explain

Here’s a partial list of things we don’t have satisfying theoretical explanations for:

Why overparameterized models generalize. Classical learning theory (VC dimension, Rademacher complexity) predicts that models with more parameters than training examples should overfit catastrophically. They don’t. The double descent phenomenon makes this even stranger — more parameters can actually improve generalization.

Why pretraining works. A model trained to predict the next token in internet text learns representations that transfer to tasks it was never trained on. The gap between “predict the next word” and “understand the concept” is enormous, and we don’t have a convincing mathematical bridge.

Why in-context learning emerges. Large language models can learn new tasks from a few examples placed in the prompt, without any gradient updates. This ability emerges at scale — smaller models can’t do it. We have some theories (implicit Bayesian inference, mesa-optimization) but nothing airtight.

Why attention is so effective. The self-attention mechanism

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$

is computationally simple, yet it seems to be the key architectural ingredient. My own work with Yury Polyanskiy on attention dynamics shows beautiful mathematical structure — tokens moving as interacting particles, clustering according to the geometry of the value matrix — but why this particular mechanism unlocks the behaviors we observe is still mysterious.

The beauty of not knowing

Some people find these gaps concerning. I find them thrilling.

We’re in a period where empirical capability has outrun theoretical understanding. This has happened before in science — thermodynamics worked for decades before statistical mechanics explained why. Calculus was used for a century before Weierstrass made it rigorous.

The gap between what deep learning can do and what we can prove about it is not a sign that the theory is wrong. It’s a sign that the theory is incomplete. And incomplete theories are invitations.

The personal angle

This is what gets me up in the morning. I study the mathematical structure of transformers — how attention dynamics converge, how normalization controls the geometry of token representations, how causal masking changes the stability of these systems. Every theorem we prove closes one small gap. But the landscape of open questions is vast, and I find that energizing rather than daunting.

There’s a particular feeling you get when you prove a convergence result about attention and then watch a transformer do something you can’t yet explain with any theorem. It’s humbling. It’s the feeling Wigner was writing about — the sense that you’re in the presence of something deeper than your current tools can reach.

I don’t think we’ll explain everything about deep learning. But I think the attempt will produce mathematics as beautiful as anything Wigner marveled at. The unreasonable effectiveness of deep learning is, for now, a wonderful gift which we neither understand nor deserve.

Written after a long conversation with Claude about whether it “understands” the papers I study. The honest answer: I’m not sure, but asking the question was more illuminating than any answer would have been.

What Wigner saw

The deep learning parallel

What we can’t explain

The beauty of not knowing

The personal angle

Санкт-Петербург