Learning Rate Scheduling ¹

May 28, 2026

Recent Advances in Neural Network Optimization for LLM Training

The optimization landscape for LLM training looks very different from two years ago. AdamW still dominates production runs, but a wave of research is eroding that dominance from multiple angles simultaneously: matrix-aware optimizers, horizon-free schedulers, a sharply revised understanding of µP, and communication-efficient distributed methods. This post synthesizes 18 recent papers across five interconnected fronts. The unifying thread is an active re-examination of long-held assumptions, from whether gradient geometry matters, to what µP is actually doing, to whether weight decay is a regularizer at all.

Learning Rate Scheduling 1

Recent Advances in Neural Network Optimization for LLM Training

Learning Rate Scheduling ¹