Research
My work sits at the intersection of mathematics and machine learning theory. I study the dynamics of attention in transformers — how tokens move, cluster, and represent information.
Papers
Normalization in Attention Dynamics
We unify different normalization schemes (LayerNorm, RMSNorm, etc.) under the umbrella of token geometry, showing that the choice of normalization acts as a speed control on attention dynamics. We prove convergence rates that identify which methods converge faster.
Other fascinating things
Papers and ideas I didn't write but find remarkable.
The Unreasonable Effectiveness of Mathematics
The classic essay on why mathematical concepts developed in pure abstraction turn out to describe physical reality with uncanny precision.
Neural Ordinary Differential Equations
The paper that started the neural ODE revolution — replacing discrete residual layers with continuous dynamics defined by an ODE solver.