Recent Advances in Neural Network Optimization for LLM Training

Le, Nhut Nam
May 28, 2026

Table of Contents

The optimization landscape for LLM training looks very different from two years ago. AdamW still dominates production runs, but a wave of research is eroding that dominance from multiple angles simultaneously: matrix-aware optimizers, horizon-free schedulers, a sharply revised understanding of µP, and communication-efficient distributed methods. This post synthesizes 18 recent papers across five interconnected fronts.

The unifying thread is an active re-examination of long-held assumptions, from whether gradient geometry matters, to what µP is actually doing, to whether weight decay is a regularizer at all.

1. Muon and Non-Euclidean Optimizers #

Background #

Muon (Momentum Urthogon*alized by Newton-Schulz*) applies a gradient orthogonalization step via a Newton-Schulz iteration before each weight update. Rather than treating each parameter as an independent scalar (as Adam does), Muon recognizes that weight matrices have geometric structure and optimizes them accordingly, performing steepest descent under the spectral norm.

The core Newton-Schulz iteration, which runs stably in bfloat16 on tensor cores, is:

$$ X \leftarrow aX + b(XX^\top)X + c(XX^\top)^2 X $$

with coefficients $a = 3.4445$, $b = -4.7750$, $c = 2.0315$. In PyTorch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def newtonschulz5(G, steps=5, eps=1e-7):
    a, b, c = (3.4445, -4.7750,  2.0315)
    X = G.bfloat16()
    X /= (X.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

A ready-to-use implementation lives at KellerJordan/Muon. Install via:

1
pip install git+https://github.com/KellerJordan/Muon

Muon is intended for hidden-layer matrix weights only. Embeddings, the output head, and scalar/vector parameters should still use AdamW:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from muon import MuonWithAuxAdam


hidden_matrix_params = [
    p for n, p in model.blocks.named_parameters()
    if p.ndim >= 2 and "embed" not in n
]
embed_params  = [p for n, p in model.named_parameters() if "embed" in n]
scalar_params = [p for p in model.parameters() if p.ndim < 2]
head_params   = [model.lm_head.weight]


optimizer = MuonWithAuxAdam(
    muon_params=hidden_matrix_params,
    lr=0.02,
    adamw_params=embed_params + scalar_params + head_params,
    adamw_lr=3e-4,
    adamw_wd=0.1,
)
# LR has built-in muP scaling, so no retuning is needed as you scale up

Scaling Muon: the Moonlight result #

MoonshotAI’s Moonlight (3B/16B-parameter MoE, trained on 5.7T tokens) provides the strongest evidence yet that Muon scales to real LLM training (arXiv:2502.16982, GitHub). Two fixes are needed to make Muon work beyond small scale:

Weight decay: without it, weight and output RMS norms grow until they overflow bfloat16.
Per-parameter update scale adjustment: matching the RMS update norm of AdamW by a factor of $\sqrt{(1-\beta_1)/(1+\beta_1)}$.

With these in place, scaling-law experiments indicate roughly 2× computational efficiency compared to AdamW at compute-optimal settings.

1
2
3
4
5
# Train a Qwen-like dense model with Muon (from Moonlight repo)
python3 examples/toy_train.py \
    --model qwen --optimizer muon \
    --dataset openwebtext-100k \
    --hidden_size 896 --lr 1e-3

A further efficiency variant is Flash-Muon, which reimplements the Newton-Schulz inner loop using a custom Triton kernel that exploits the symmetry of the $XX^\top$ computation, halving the effective FLOP count.

Theoretical foundations #

Kovalev (2025) shows in Understanding Gradient Orthogonalization via Non-Euclidean Trust-Region Optimization that the orthogonalized gradient update can be interpreted as a first-order trust-region method where the trust-region is defined in terms of the matrix spectral norm. This framework unifies Muon with normalized SGD and signSGD with momentum.

Pethick et al. (2025) propose Scion, a family of LMO-based algorithms that subsumes Muon, AdamW, and normalized SGD under a single framework (arXiv:2502.07529). By choosing an explicit norm for deep architectures, Scion also achieves hyperparameter transferability across model widths.

The Polar Express (Amsel et al., 2025) replaces Newton-Schulz with a minimax polar decomposition, solving a minimax problem at each iteration to minimize worst-case error. It converges faster than Newton-Schulz in both early and asymptotic stages, while remaining numerically stable in bfloat16.

Challenging the geometric narrative #

Despite the theoretical appeal, Shumaylov et al. (2026) mount a systematic challenge in Muon is Not That Special: Random or Inverted Spectra Work Just as Well. They introduce:

Freon: a family of optimizers based on Schatten (quasi-)norms, interpolating between SGD and Muon. The best-performing Schatten parameter for GPT-2 lies in the quasi-norm regime, which no LMO-based optimizer can represent.
Kaon: replaces Muon’s singular values with random noise, yet still matches Muon’s validation loss on GPT-2.

Their key insight: performance is primarily controlled by two local quantities, alignment (how well the update direction aligns with the gradient) and descent potential (step-size optimality). Muon succeeds by guaranteeing step-size optimality, not by tracking an ideal geometry.

Optimizer	Core mechanism	Key claim
Muon	Newton-Schulz orthogonalization	~2× efficiency over AdamW at compute-optimal
Scion	LMO over norm-ball	Unifies Muon/Adam; HP transferable across widths
Polar Express	Minimax polar decomposition	Faster convergence; bfloat16-safe
Freon / Kaon	Schatten quasi-norms / random SVs	Geometry is irrelevant; alignment drives performance

2. Learning Rate Scheduling #

Linear decay is provably optimal #

Defazio et al. (2023/2024) close a long-standing gap between theory and practice in Optimal Linear Decay Learning Rate Schedules and Further Refinements (arXiv:2310.07831). Under worst-case analysis, linear decay, setting $\eta_t \propto (1 - t/T)$, is the theoretically optimal schedule for a broad class of optimizers including SGD. Across 10 diverse benchmarks, it consistently outperforms cosine annealing.

$$ \eta_t = \eta_{\max} \cdot \left(1 - \frac{t}{T}\right) $$

1
2
3
4
# PyTorch built-in, the optimal default
scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=1.0, end_factor=0.0, total_iters=total_steps
)

The WSD cooldown phase #

The Warmup-Stable-Decay (WSD) scheduler separates training into distinct phases ending in a sharp LR drop. Dremov et al. (2025) analyse the cooldown phase specifically in Training Dynamics of the Cooldown Stage in WSD, finding:

Cooldown shapes that balance exploration and exploitation consistently outperform purely exploratory or exploitative alternatives.
There is substantial sensitivity to AdamW’s $\beta_2$ parameter during cooldown, and higher $\beta_2$ values yield consistent improvements.
Loss-landscape visualisations support the “river valley” perspective: the cooldown follows a narrow valley in parameter space.

Convex theory meets LLM practice #

Schaipp et al. (2025) show in The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training that schedules for large model training obey performance bounds from non-smooth convex optimisation. For the constant schedule with linear cooldown, the bound is:

$$ \bar{f}T - f^* \leq \frac{|x_0 - x^*|^2}{2\eta T} + \frac{\eta}{2} \sum{t=0}^{T-1} \sigma_t^2 $$

where the cooldown benefit appears explicitly through the absence of logarithmic terms. This enables principled LR transfer: exploiting the theory yields noticeable validation loss improvements for 124M and 210M Llama-type models when extending schedules for continued training.

Anytime schedules and weight averaging #

Meterez et al. (2026) prove in Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging (arXiv:2602.03702) that horizon-free (anytime) schedules exist for overparameterised linear regression, with weight averaging central to achieving minimax-optimal convergence. At 150M–300M params trained at 1–32× Chinchilla scale, a constant LR with weight averaging matches well-tuned cosine decay across the full training duration.

Weight averaging is a largely underutilised practical lever. It should be a default, not an afterthought.

ScheduleFree+ at LLM scale #

Defazio (2026) extends schedule-free learning to full LLM pretraining in ScheduleFree+: Scaling Learning-Rate-Free and Schedule-Free Learning to Large Language Models (arXiv:2605.19095). Practical fixes for large batch and model sizes enable ScheduleFree+ to achieve a 31% improvement over WSD schedules at 1000 tokens per parameter, while also providing a theoretical foundation for checkpoint merging during pretraining.

1
pip install schedulefree

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from schedulefree import AdamWScheduleFree


optimizer = AdamWScheduleFree(
    model.parameters(), lr=1e-3, warmup_steps=1000
)


# Must switch to eval mode before evaluation
optimizer.eval()
val_loss = evaluate(model)
optimizer.train()

GitHub: facebookresearch/schedule_free

3. Hyperparameter Transfer and Scaling Laws (µP) #

Weight decay as the true driver of LR transfer #

The Maximal Update Parameterisation (µP) is widely used to transfer optimal learning rates from proxy models to large ones without re-tuning. Kosson et al. (2025/2026), accepted to ICLR 2026, provide a large-scale empirical refutation of the standard µP narrative in Weight Decay May Matter More than µP for Learning Rate Transfer in Practice.

Their finding: µP’s geometric alignment assumptions, which require alignment between a layer’s inputs, weights, and gradient updates, hold only briefly at the start of training. For the remainder, it is weight decay that stabilises update dynamics across widths and facilitates LR transfer. This implies µP’s scaling primarily acts as an implicit warmup, and can be largely replaced by modified warmup schedules.

Embedding layer LR as the key factor #

Kalra & Barkeshli (2026) provide complementary evidence in Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate, tracing µP’s advantage over standard parameterisation (SP) to a single factor: the embedding layer learning rate.

In SP, the embedding LR acts as a training bottleneck. Simply increasing it by a factor of model width, matching µP, eliminates most of the gap. Three quantitative metrics are used: quality of scaling law fit, robustness to extrapolation errors, and asymptotic loss penalty.

1
2
3
4
5
6
7
8
9
# Simple fix that captures most of µP's benefit in SP
embed_lr_multiplier = model_width / base_width  # = d_model / d_model_proxy


param_groups = [
    {"params": model.embed.parameters(), "lr": base_lr * embed_lr_multiplier},
    {"params": non_embed_params,         "lr": base_lr},
]
optimizer = torch.optim.AdamW(param_groups, weight_decay=0.1)

Open question: Kosson et al. argue µP acts as an implicit warmup; Kalra & Barkeshli argue it is about the embedding LR. Both contradict µP’s original geometric motivation. No consensus has emerged, and the practical implications differ significantly.

4. Normalization, Weight Decay, and Variance Reduction #

The end-of-training gradient spike #

Defazio (2025) identifies a subtle pathology in Why Gradients Rapidly Increase Near the End of Training: gradient norms spike sharply near the end of long LLM runs. The diagnosis is a three-way interaction between weight decay, normalisation layers, and the LR schedule.

When a layer is followed by normalisation, its scale becomes irrelevant to the forward pass, but weight decay continues shrinking the parameters. This creates an implicit competition between the optimizer’s effective update size and normalisation rescaling, causing gradient norms to grow unchecked as the LR decays.

Fix: disable weight decay for AdamW-updated layers in architectures where those layers are directly followed by normalisation (e.g. every transformer block):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
no_wd, wd = [], []
for name, param in model.named_parameters():
    if "norm" in name or "embed" in name or param.ndim < 2:
        no_wd.append(param)
    else:
        wd.append(param)


optimizer = torch.optim.AdamW([
    {"params": wd,     "weight_decay": 0.1},
    {"params": no_wd,  "weight_decay": 0.0},
], lr=3e-4)

This simultaneously eliminates the spike and reduces loss throughout training. The analysis explains why weight decay should be disabled for AdamW-updated layers in architectures like modded-nanoGPT.

Weight normalisation as an alternative #

Nemotron-Flash (Fu et al., 2025, NeurIPS 2025) investigates weight normalisation as a practical mechanism in small language models, finding that it enables more effective weight updates and improves final convergence. Weight normalisation sidesteps the weight-decay/normalisation interaction described above, though at the cost of slightly worse final loss compared to a well-tuned baseline.

MARS: variance reduction meets preconditioned gradients #

Despite decades of theoretical work, variance reduction has largely failed to yield practical gains in deep learning. Yuan et al. (2024/2025) attempt to change this in MARS: Unleashing the Power of Variance Reduction for Training Large Models, proposing a unified framework that reconciles AdamW, Lion, and Shampoo with variance reduction via a scaled stochastic recursive momentum technique.

GPT-2 training results look strong. However, the comprehensive benchmark by Semenov et al. (2025), Benchmarking Optimizers for Large Language Model Pretraining, a 73-page study covering 44 figures and 48 tables across standardised scenarios, reveals that MARS does not work well with small batch sizes, limiting its practical applicability in memory-constrained settings.

This underscores the danger of evaluating optimizers on a single benchmark setup: MARS looks excellent at the batch sizes used in the original paper and brittle elsewhere.

5. Distributed Training: DiLoCo and Its Descendants #

DiLoCo (Distributed Low-Communication training) uses AdamW as an inner optimizer for $H$ local steps on each worker (typically $H = 500$), then synchronises by applying Nesterov momentum to the pseudo-gradient, the sum of all parameter changes across those inner steps. This reduces communication frequency by up to 500×.

OpenDiLoCo: the open-source foundation #

PrimeIntellect’s OpenDiLoCo provides a reproducible drop-in implementation, demonstrated training across two continents and three countries with 90–95% compute utilisation. It later served as the foundation for INTELLECT-1, a 10B-parameter model trained globally.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from functools import partial
from open_diloco.hivemind_diloco import DiLoCoOptimizer


inner_optimizer = partial(torch.optim.AdamW, lr=4e-4)
outer_optimizer = partial(
    torch.optim.SGD, lr=0.7, momentum=0.9, nesterov=True
)


optimizer = DiLoCoOptimizer(
    dht=dht,
    params=model.parameters(),
    batch_size=512,
    num_inner_steps=500,  # sync every 500 steps, 500× fewer communications
    inner_optimizer=inner_optimizer,
    outer_optimizer=outer_optimizer,
)

Why DiLoCo works on a single node: SNOO #

Kallusky et al. (2025) show in SNOO: Step-K Nesterov Outer Optimizer that DiLoCo’s effectiveness, even on a single node, stems from applying Nesterov momentum to the pseudo-gradient. Their method isolates this as a standalone Lookahead variant. Results:

1.5–2.5× FLOPs efficiency gains up to $10^{23}$ training FLOPs.
Improvements increase with model size.
Compatible with both AdamW and Muon as inner optimizers.
Minimal memory overhead.

The single-worker DiLoCo achieves speedups of up to 6.32% in steps-to-loss over AdamW on a 160M Llama model.

Smoothing DiLoCo: Generalized Primal Averaging (GPA) #

Defazio et al. (2025/2026) propose GPA in Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (arXiv:2512.17131), which decouples DiLoCo’s interpolation constants to enable smooth iterate averaging at every step, replacing uniform averaging with exponential moving averaging.

GPA unifies single-worker DiLoCo and ScheduleFree within a single non-distributed framework. Speedups over AdamW in steps-to-target-loss:

Model	Speedup
Llama-160M	8.71%
Llama-1B	10.13%
Llama-8B	9.58%

Streaming DiLoCo: towards free distributed training #

Douillard et al. (2025) address the remaining bottleneck in Streaming DiLoCo with Overlapping Communication: Towards a Distributed Free Lunch (arXiv:2501.18512): even with infrequent synchronisation, each sync exchanges all parameters simultaneously. Three fixes:

Streaming sync: synchronise only subsets of parameters at a time.
Overlapping communication: continue training during synchronisation.
Quantisation: reduce cross-worker data to fewer bits.

Together, required bandwidth drops by two orders of magnitude while maintaining comparable quality at billion-parameter scale.

Method	Setting	Key contribution	Gain
SNOO	Single-node	Nesterov momentum on pseudo-gradient	1.5–2.5× FLOP efficiency
GPA	Single-node	Smooth iterate averaging; unifies DiLoCo + SF	~9% steps-to-loss
Streaming DiLoCo	Distributed	Streaming sync + quantisation	~100× bandwidth reduction

6. Cross-Cutting Themes and Open Questions #

Several recurrent tensions emerge from reading these papers together.

Geometry vs. step-size calibration in Muon #

Kovalev, Pethick et al., and Amsel et al. offer geometric explanations for Muon’s success. Shumaylov et al. argue that geometry is practically irrelevant and step-size optimality is the true driver. Which narrative guides future research matters: geometry points toward more sophisticated matrix norms; the step-size interpretation suggests much simpler paths to similar gains.

What µP is actually doing #

Kosson et al. argue µP is primarily an implicit warmup mechanism. Kalra & Barkeshli argue it is essentially about the embedding layer LR. Both stand in contrast to µP’s original geometric motivation. The practical stakes are high: the warmup interpretation suggests µP can be discarded with a schedule change; the embedding LR interpretation suggests a single-line fix.

Weight decay as a multi-role hyperparameter #

Weight decay appears as a protagonist in three independent stories in this survey:

Defazio: source of end-of-training gradient spikes via interaction with normalisation.
Kosson et al.: the true driver of LR transfer, not µP geometry.
Kalra & Barkeshli: improves scaling law fits but hurts extrapolation robustness.

It is no longer tenable to treat weight decay as a simple regulariser with a sensible default. It must be understood per-layer and in interaction with your normalisation strategy.

DiLoCo as the practical distributed optimizer #

Despite a large body of research on distributed optimizers, DiLoCo and its derivatives appear to be the only methods that consistently add value beyond simply scaling the batch size. The finding that its benefits carry over to single-node settings (via SNOO and GPA) makes it a particularly important line of work for practitioners at all scales.

Practical Recommendations for 2026 #

Based on the convergence of evidence across these papers, for a new large training run consider:

Optimizer: Muon for hidden-layer matrix weights + AdamW for embeddings/head. The Moonlight scaling fixes (weight decay + update scale adjustment) are necessary above ~1B parameters.
Schedule: ScheduleFree+ or linear decay instead of cosine. If you need a fixed-horizon schedule, WSD with higher $\beta_2$ during cooldown.
Weight decay: Disable it for layers directly followed by normalisation to avoid end-of-training gradient spikes.
Outer optimizer: Wrap your training loop with single-worker DiLoCo (SNOO or GPA) for a ~9% efficiency gain with no architectural changes.
µP alternatives: Before adopting full µP overhead, try increasing the embedding layer LR by a factor of $d_{\text{model}} / d_{\text{proxy}}$. This may reproduce most of the benefit.

None of these require fundamental architectural changes.

References #

#	Paper	Venue	Links
1	Jordan et al. (2024): Muon: An optimizer for hidden layers	n/a	blog · GitHub
2	Liu et al. (2025): Muon is Scalable for LLM Training (Moonlight)	n/a	arXiv:2502.16982 · GitHub
3	Kovalev (2025): Understanding Gradient Orthogonalization	n/a	n/a
4	Pethick et al. (2025): Training Deep Learning Models with Norm-Constrained LMOs (Scion)	n/a	arXiv:2502.07529
5	Amsel et al. (2025): The Polar Express	n/a	n/a
6	Shumaylov et al. (2026): Muon is Not That Special (Freon/Kaon)	n/a	n/a
7	Defazio et al. (2023): Optimal Linear Decay Learning Rate Schedules	n/a	arXiv:2310.07831
8	Dremov et al. (2025): Training Dynamics of the Cooldown Stage in WSD	n/a	n/a
9	Schaipp et al. (2025): Surprising Agreement Between Convex Theory and LR Scheduling	n/a	n/a
10	Meterez et al. (2026): Anytime Pretraining	n/a	arXiv:2602.03702
11	Defazio (2026): ScheduleFree+	n/a	arXiv:2605.19095 · GitHub
12	Kosson et al. (2026): Weight Decay May Matter More than µP	ICLR 2026	n/a
13	Kalra & Barkeshli (2026): Quantifying HP Transfer and Embedding LR	n/a	n/a
14	Defazio (2025): Why Gradients Rapidly Increase Near End of Training	n/a	n/a
15	Fu et al. (2025): Nemotron-Flash	NeurIPS 2025	n/a
16	Yuan et al. (2025): MARS	n/a	n/a
17	Semenov et al. (2025): Benchmarking Optimizers for LLM Pretraining	n/a	n/a
18	Kallusky et al. (2025): SNOO	n/a	n/a
19	Defazio et al. (2026): Smoothing DiLoCo with Primal Averaging (GPA)	n/a	arXiv:2512.17131
20	Douillard et al. (2025): Streaming DiLoCo	n/a	arXiv:2501.18512
21	Douillard et al. (2023/2024): DiLoCo (original)	n/a	arXiv:2311.08105
22	PrimeIntellect AI (2024): OpenDiLoCo	n/a	GitHub · blog

Tags:

Categories: