Nam Le

Learning Rate Scheduling 1

Recent Advances in Neural Network Optimization for LLM Training

The optimization landscape for LLM training looks very different from two years ago. AdamW still dominates production runs, but a wave of research is eroding that dominance from multiple angles simultaneously: matrix-aware optimizers, horizon-free schedulers, a sharply revised understanding of µP, and communication-efficient distributed methods. This post synthesizes 18 recent papers across five interconnected fronts. The unifying thread is an active re-examination of long-held assumptions, from whether gradient geometry matters, to what µP is actually doing, to whether weight decay is a regularizer at all.