<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Muon on Nam Le</title><link>https://blog.namln.org/en/tags/muon/</link><description>Recent content in Muon on Nam Le</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Thu, 28 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.namln.org/en/tags/muon/index.xml" rel="self" type="application/rss+xml"/><item><title>Recent Advances in Neural Network Optimization for LLM Training</title><link>https://blog.namln.org/en/posts/llm-optimization-2025-survey/</link><pubDate>Thu, 28 May 2026 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/posts/llm-optimization-2025-survey/</guid><description>&lt;p&gt;The optimization landscape for LLM training looks very different from two years
ago. AdamW still dominates production runs, but a wave of research is eroding
that dominance from multiple angles simultaneously: matrix-aware optimizers,
horizon-free schedulers, a sharply revised understanding of µP, and
communication-efficient distributed methods. This post synthesizes 18 recent
papers across five interconnected fronts.&lt;/p&gt;
&lt;p&gt;The unifying thread is an active re-examination of long-held assumptions, from
whether gradient geometry matters, to what µP is actually doing, to whether
weight decay is a regularizer at all.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="1-muon-and-non-euclidean-optimizers"&gt;
 1. Muon and Non-Euclidean Optimizers&lt;span class="heading__anchor"&gt; &lt;a href="#1-muon-and-non-euclidean-optimizers"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="background"&gt;
 Background&lt;span class="heading__anchor"&gt; &lt;a href="#background"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Muon&lt;/strong&gt; (&lt;em&gt;&lt;strong&gt;Mo&lt;/strong&gt;mentum &lt;strong&gt;U&lt;/strong&gt;rthog&lt;/em&gt;&lt;em&gt;on&lt;/em&gt;*alized by Newton-Schulz*) applies a
gradient orthogonalization step via a Newton-Schulz iteration before each weight
update. Rather than treating each parameter as an independent scalar (as Adam
does), Muon recognizes that weight matrices have geometric structure and
optimizes them accordingly, performing steepest descent under the &lt;strong&gt;spectral
norm&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The core Newton-Schulz iteration, which runs stably in &lt;code&gt;bfloat16&lt;/code&gt; on tensor
cores, is:&lt;/p&gt;
&lt;p&gt;$$
X \leftarrow aX + b(XX^\top)X + c(XX^\top)^2 X
$$&lt;/p&gt;
&lt;p&gt;with coefficients $a = 3.4445$, $b = -4.7750$, $c = 2.0315$. In PyTorch:&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;newtonschulz5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;3.4445&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;4.7750&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.0315&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;A ready-to-use implementation lives at
&lt;a href="https://github.com/KellerJordan/Muon"&gt;KellerJordan/Muon&lt;/a&gt;. Install via:&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install git+https://github.com/KellerJordan/Muon&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;Muon is intended for hidden-layer matrix weights only. Embeddings, the output
head, and scalar/vector parameters should still use AdamW:&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;muon&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MuonWithAuxAdam&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hidden_matrix_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;embed&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;embed_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;embed&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;scalar_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndim&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;head_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lm_head&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MuonWithAuxAdam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;muon_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hidden_matrix_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;adamw_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embed_params&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;scalar_params&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;head_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;adamw_lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;adamw_wd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# LR has built-in muP scaling, so no retuning is needed as you scale up&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;h3 class="heading" id="scaling-muon-the-moonlight-result"&gt;
 Scaling Muon: the Moonlight result&lt;span class="heading__anchor"&gt; &lt;a href="#scaling-muon-the-moonlight-result"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;MoonshotAI&amp;rsquo;s &lt;strong&gt;Moonlight&lt;/strong&gt; (3B/16B-parameter MoE, trained on 5.7T tokens)
provides the strongest evidence yet that Muon scales to real LLM training
(&lt;a href="https://arxiv.org/abs/2502.16982"&gt;arXiv:2502.16982&lt;/a&gt;,
&lt;a href="https://github.com/MoonshotAI/Moonlight"&gt;GitHub&lt;/a&gt;). Two fixes are needed to
make Muon work beyond small scale:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Weight decay:&lt;/strong&gt; without it, weight and output RMS norms grow until they
overflow &lt;code&gt;bfloat16&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per-parameter update scale adjustment:&lt;/strong&gt; matching the RMS update norm of
AdamW by a factor of $\sqrt{(1-\beta_1)/(1+\beta_1)}$.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With these in place, scaling-law experiments indicate roughly &lt;strong&gt;2× computational
efficiency&lt;/strong&gt; compared to AdamW at compute-optimal settings.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Train a Qwen-like dense model with Muon (from Moonlight repo)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python3 examples/toy_train.py &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --model qwen --optimizer muon &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --dataset openwebtext-100k &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --hidden_size &lt;span class="m"&gt;896&lt;/span&gt; --lr 1e-3&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;A further efficiency variant is
&lt;a href="https://github.com/nil0x9/flash-muon"&gt;Flash-Muon&lt;/a&gt;, which reimplements the
Newton-Schulz inner loop using a custom Triton kernel that exploits the symmetry
of the $XX^\top$ computation, halving the effective FLOP count.&lt;/p&gt;
&lt;h3 class="heading" id="theoretical-foundations"&gt;
 Theoretical foundations&lt;span class="heading__anchor"&gt; &lt;a href="#theoretical-foundations"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Kovalev (2025)&lt;/strong&gt; shows in &lt;em&gt;Understanding Gradient Orthogonalization via
Non-Euclidean Trust-Region Optimization&lt;/em&gt; that the orthogonalized gradient update
can be interpreted as a first-order trust-region method where the trust-region is
defined in terms of the matrix spectral norm. This framework unifies Muon with
normalized SGD and signSGD with momentum.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pethick et al. (2025)&lt;/strong&gt; propose &lt;strong&gt;Scion&lt;/strong&gt;, a family of LMO-based algorithms
that subsumes Muon, AdamW, and normalized SGD under a single framework
(&lt;a href="https://arxiv.org/abs/2502.07529"&gt;arXiv:2502.07529&lt;/a&gt;). By choosing an explicit
norm for deep architectures, Scion also achieves hyperparameter transferability
across model widths.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Polar Express&lt;/strong&gt; (Amsel et al., 2025) replaces Newton-Schulz with a minimax
polar decomposition, solving a minimax problem at each iteration to minimize
worst-case error. It converges faster than Newton-Schulz in both early and
asymptotic stages, while remaining numerically stable in &lt;code&gt;bfloat16&lt;/code&gt;.&lt;/p&gt;
&lt;h3 class="heading" id="challenging-the-geometric-narrative"&gt;
 Challenging the geometric narrative&lt;span class="heading__anchor"&gt; &lt;a href="#challenging-the-geometric-narrative"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Despite the theoretical appeal, &lt;strong&gt;Shumaylov et al. (2026)&lt;/strong&gt; mount a systematic
challenge in &lt;em&gt;Muon is Not That Special: Random or Inverted Spectra Work Just as
Well&lt;/em&gt;. They introduce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Freon:&lt;/strong&gt; a family of optimizers based on Schatten (quasi-)norms,
interpolating between SGD and Muon. The best-performing Schatten parameter for
GPT-2 lies in the &lt;em&gt;quasi-norm&lt;/em&gt; regime, which no LMO-based optimizer can
represent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kaon:&lt;/strong&gt; replaces Muon&amp;rsquo;s singular values with random noise, yet still
matches Muon&amp;rsquo;s validation loss on GPT-2.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Their key insight: performance is primarily controlled by two local quantities,
&lt;em&gt;alignment&lt;/em&gt; (how well the update direction aligns with the gradient) and &lt;em&gt;descent
potential&lt;/em&gt; (step-size optimality). Muon succeeds by guaranteeing step-size
optimality, not by tracking an ideal geometry.&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Optimizer&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Core mechanism&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Key claim&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Muon&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Newton-Schulz orthogonalization&lt;/td&gt;
					&lt;td style="text-align: left"&gt;~2× efficiency over AdamW at compute-optimal&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Scion&lt;/td&gt;
					&lt;td style="text-align: left"&gt;LMO over norm-ball&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Unifies Muon/Adam; HP transferable across widths&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Polar Express&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Minimax polar decomposition&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Faster convergence; bfloat16-safe&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Freon / Kaon&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Schatten quasi-norms / random SVs&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Geometry is irrelevant; alignment drives performance&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="2-learning-rate-scheduling"&gt;
 2. Learning Rate Scheduling&lt;span class="heading__anchor"&gt; &lt;a href="#2-learning-rate-scheduling"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="linear-decay-is-provably-optimal"&gt;
 Linear decay is provably optimal&lt;span class="heading__anchor"&gt; &lt;a href="#linear-decay-is-provably-optimal"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio et al. (2023/2024)&lt;/strong&gt; close a long-standing gap between theory and
practice in &lt;em&gt;Optimal Linear Decay Learning Rate Schedules and Further
Refinements&lt;/em&gt; (&lt;a href="https://arxiv.org/abs/2310.07831"&gt;arXiv:2310.07831&lt;/a&gt;). Under
worst-case analysis, &lt;strong&gt;linear decay&lt;/strong&gt;, setting $\eta_t \propto (1 - t/T)$, is
the theoretically optimal schedule for a broad class of optimizers including SGD.
Across 10 diverse benchmarks, it consistently outperforms cosine annealing.&lt;/p&gt;
&lt;p&gt;$$
\eta_t = \eta_{\max} \cdot \left(1 - \frac{t}{T}\right)
$$&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# PyTorch built-in, the optimal default&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lr_scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LinearLR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_factor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_factor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;total_steps&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;h3 class="heading" id="the-wsd-cooldown-phase"&gt;
 The WSD cooldown phase&lt;span class="heading__anchor"&gt; &lt;a href="#the-wsd-cooldown-phase"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;The Warmup-Stable-Decay (WSD) scheduler separates training into distinct phases
ending in a sharp LR drop. &lt;strong&gt;Dremov et al. (2025)&lt;/strong&gt; analyse the cooldown phase
specifically in &lt;em&gt;Training Dynamics of the Cooldown Stage in WSD&lt;/em&gt;, finding:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cooldown shapes that balance exploration and exploitation consistently
outperform purely exploratory or exploitative alternatives.&lt;/li&gt;
&lt;li&gt;There is substantial sensitivity to AdamW&amp;rsquo;s $\beta_2$ parameter during
cooldown, and &lt;strong&gt;higher $\beta_2$ values yield consistent improvements&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Loss-landscape visualisations support the &amp;ldquo;river valley&amp;rdquo; perspective: the
cooldown follows a narrow valley in parameter space.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 class="heading" id="convex-theory-meets-llm-practice"&gt;
 Convex theory meets LLM practice&lt;span class="heading__anchor"&gt; &lt;a href="#convex-theory-meets-llm-practice"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Schaipp et al. (2025)&lt;/strong&gt; show in &lt;em&gt;The Surprising Agreement Between Convex
Optimization Theory and Learning-Rate Scheduling for Large Model Training&lt;/em&gt; that
schedules for large model training obey performance bounds from non-smooth convex
optimisation. For the constant schedule with linear cooldown, the bound is:&lt;/p&gt;
&lt;p&gt;$$
\bar{f}&lt;em&gt;T - f^* \leq \frac{|x_0 - x^*|^2}{2\eta T} + \frac{\eta}{2} \sum&lt;/em&gt;{t=0}^{T-1} \sigma_t^2
$$&lt;/p&gt;
&lt;p&gt;where the cooldown benefit appears explicitly through the absence of logarithmic
terms. This enables &lt;strong&gt;principled LR transfer&lt;/strong&gt;: exploiting the theory yields
noticeable validation loss improvements for 124M and 210M Llama-type models when
extending schedules for continued training.&lt;/p&gt;
&lt;h3 class="heading" id="anytime-schedules-and-weight-averaging"&gt;
 Anytime schedules and weight averaging&lt;span class="heading__anchor"&gt; &lt;a href="#anytime-schedules-and-weight-averaging"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Meterez et al. (2026)&lt;/strong&gt; prove in &lt;em&gt;Anytime Pretraining: Horizon-Free
Learning-Rate Schedules with Weight Averaging&lt;/em&gt;
(&lt;a href="https://arxiv.org/abs/2602.03702"&gt;arXiv:2602.03702&lt;/a&gt;) that horizon-free (anytime)
schedules exist for overparameterised linear regression, with &lt;strong&gt;weight averaging&lt;/strong&gt;
central to achieving minimax-optimal convergence. At 150M–300M params trained at
1–32× Chinchilla scale, a constant LR with weight averaging matches well-tuned
cosine decay across the full training duration.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Weight averaging is a largely underutilised practical lever. It should be a
default, not an afterthought.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 class="heading" id="schedulefree-at-llm-scale"&gt;
 ScheduleFree+ at LLM scale&lt;span class="heading__anchor"&gt; &lt;a href="#schedulefree-at-llm-scale"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio (2026)&lt;/strong&gt; extends schedule-free learning to full LLM pretraining in
&lt;em&gt;ScheduleFree+: Scaling Learning-Rate-Free and Schedule-Free Learning to Large
Language Models&lt;/em&gt; (&lt;a href="https://arxiv.org/abs/2605.19095"&gt;arXiv:2605.19095&lt;/a&gt;).
Practical fixes for large batch and model sizes enable ScheduleFree+ to achieve
a &lt;strong&gt;31% improvement&lt;/strong&gt; over WSD schedules at 1000 tokens per parameter, while
also providing a theoretical foundation for checkpoint merging during pretraining.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install schedulefree&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;
&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;schedulefree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AdamWScheduleFree&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AdamWScheduleFree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warmup_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Must switch to eval mode before evaluation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;val_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;GitHub: &lt;a href="https://github.com/facebookresearch/schedule_free"&gt;facebookresearch/schedule_free&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="3-hyperparameter-transfer-and-scaling-laws-µp"&gt;
 3. Hyperparameter Transfer and Scaling Laws (µP)&lt;span class="heading__anchor"&gt; &lt;a href="#3-hyperparameter-transfer-and-scaling-laws-%c2%b5p"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="weight-decay-as-the-true-driver-of-lr-transfer"&gt;
 Weight decay as the true driver of LR transfer&lt;span class="heading__anchor"&gt; &lt;a href="#weight-decay-as-the-true-driver-of-lr-transfer"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;The Maximal Update Parameterisation (µP) is widely used to transfer optimal
learning rates from proxy models to large ones without re-tuning. &lt;strong&gt;Kosson et al.
(2025/2026)&lt;/strong&gt;, accepted to ICLR 2026, provide a large-scale empirical refutation
of the standard µP narrative in &lt;em&gt;Weight Decay May Matter More than µP for
Learning Rate Transfer in Practice&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Their finding: µP&amp;rsquo;s geometric alignment assumptions, which require alignment
between a layer&amp;rsquo;s inputs, weights, and gradient updates, hold only &lt;strong&gt;briefly at
the start of training&lt;/strong&gt;. For the remainder, it is &lt;strong&gt;weight decay&lt;/strong&gt; that
stabilises update dynamics across widths and facilitates LR transfer. This
implies µP&amp;rsquo;s scaling primarily acts as an implicit warmup, and can be largely
replaced by modified warmup schedules.&lt;/p&gt;
&lt;h3 class="heading" id="embedding-layer-lr-as-the-key-factor"&gt;
 Embedding layer LR as the key factor&lt;span class="heading__anchor"&gt; &lt;a href="#embedding-layer-lr-as-the-key-factor"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Kalra &amp;amp; Barkeshli (2026)&lt;/strong&gt; provide complementary evidence in &lt;em&gt;Quantifying
Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate&lt;/em&gt;,
tracing µP&amp;rsquo;s advantage over standard parameterisation (SP) to a single factor:
the &lt;strong&gt;embedding layer learning rate&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In SP, the embedding LR acts as a training bottleneck. Simply increasing it by a
factor of model width, matching µP, eliminates most of the gap. Three
quantitative metrics are used: quality of scaling law fit, robustness to
extrapolation errors, and asymptotic loss penalty.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;span class="lnt"&gt;9
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simple fix that captures most of µP&amp;#39;s benefit in SP&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;embed_lr_multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;base_width&lt;/span&gt; &lt;span class="c1"&gt;# = d_model / d_model_proxy&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;param_groups&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;lr&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_lr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;embed_lr_multiplier&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;non_embed_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;lr&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_lr&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_groups&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight_decay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;&lt;strong&gt;Open question:&lt;/strong&gt; Kosson et al. argue µP acts as an implicit warmup; Kalra &amp;amp;
Barkeshli argue it is about the embedding LR. Both contradict µP&amp;rsquo;s original
geometric motivation. No consensus has emerged, and the practical implications
differ significantly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="4-normalization-weight-decay-and-variance-reduction"&gt;
 4. Normalization, Weight Decay, and Variance Reduction&lt;span class="heading__anchor"&gt; &lt;a href="#4-normalization-weight-decay-and-variance-reduction"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="the-end-of-training-gradient-spike"&gt;
 The end-of-training gradient spike&lt;span class="heading__anchor"&gt; &lt;a href="#the-end-of-training-gradient-spike"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio (2025)&lt;/strong&gt; identifies a subtle pathology in &lt;em&gt;Why Gradients Rapidly
Increase Near the End of Training&lt;/em&gt;: gradient norms spike sharply near the end of
long LLM runs. The diagnosis is a three-way interaction between &lt;strong&gt;weight decay&lt;/strong&gt;,
&lt;strong&gt;normalisation layers&lt;/strong&gt;, and the &lt;strong&gt;LR schedule&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;When a layer is followed by normalisation, its scale becomes irrelevant to the
forward pass, but weight decay continues shrinking the parameters. This creates
an implicit competition between the optimizer&amp;rsquo;s effective update size and
normalisation rescaling, causing gradient norms to grow unchecked as the LR
decays.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; disable weight decay for AdamW-updated layers in architectures where
those layers are directly followed by normalisation (e.g. every transformer
block):&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;no_wd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;param&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;norm&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;embed&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndim&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;no_wd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;wd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;wd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;weight_decay&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;no_wd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;weight_decay&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3e-4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;This simultaneously eliminates the spike and reduces loss throughout training.
The analysis explains why weight decay should be disabled for AdamW-updated
layers in architectures like modded-nanoGPT.&lt;/p&gt;
&lt;h3 class="heading" id="weight-normalisation-as-an-alternative"&gt;
 Weight normalisation as an alternative&lt;span class="heading__anchor"&gt; &lt;a href="#weight-normalisation-as-an-alternative"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Nemotron-Flash&lt;/strong&gt; (Fu et al., 2025, NeurIPS 2025) investigates weight
normalisation as a practical mechanism in small language models, finding that it
enables more effective weight updates and improves final convergence. Weight
normalisation sidesteps the weight-decay/normalisation interaction described
above, though at the cost of slightly worse final loss compared to a well-tuned
baseline.&lt;/p&gt;
&lt;h3 class="heading" id="mars-variance-reduction-meets-preconditioned-gradients"&gt;
 MARS: variance reduction meets preconditioned gradients&lt;span class="heading__anchor"&gt; &lt;a href="#mars-variance-reduction-meets-preconditioned-gradients"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Despite decades of theoretical work, variance reduction has largely failed to
yield practical gains in deep learning. &lt;strong&gt;Yuan et al. (2024/2025)&lt;/strong&gt; attempt to
change this in &lt;em&gt;MARS: Unleashing the Power of Variance Reduction for Training
Large Models&lt;/em&gt;, proposing a unified framework that reconciles AdamW, Lion, and
Shampoo with variance reduction via a &lt;strong&gt;scaled stochastic recursive momentum&lt;/strong&gt;
technique.&lt;/p&gt;
&lt;p&gt;GPT-2 training results look strong. However, the comprehensive benchmark by
&lt;strong&gt;Semenov et al. (2025)&lt;/strong&gt;, &lt;em&gt;Benchmarking Optimizers for Large Language Model
Pretraining&lt;/em&gt;, a 73-page study covering 44 figures and 48 tables across
standardised scenarios, reveals that &lt;strong&gt;MARS does not work well with small batch
sizes&lt;/strong&gt;, limiting its practical applicability in memory-constrained settings.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This underscores the danger of evaluating optimizers on a single benchmark
setup: MARS looks excellent at the batch sizes used in the original paper and
brittle elsewhere.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="5-distributed-training-diloco-and-its-descendants"&gt;
 5. Distributed Training: DiLoCo and Its Descendants&lt;span class="heading__anchor"&gt; &lt;a href="#5-distributed-training-diloco-and-its-descendants"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;DiLoCo (Distributed Low-Communication training) uses AdamW as an &lt;em&gt;inner&lt;/em&gt;
optimizer for $H$ local steps on each worker (typically $H = 500$), then
synchronises by applying Nesterov momentum to the &lt;strong&gt;pseudo-gradient&lt;/strong&gt;, the sum
of all parameter changes across those inner steps. This reduces communication
frequency by up to 500×.&lt;/p&gt;
&lt;h3 class="heading" id="opendiloco-the-open-source-foundation"&gt;
 OpenDiLoCo: the open-source foundation&lt;span class="heading__anchor"&gt; &lt;a href="#opendiloco-the-open-source-foundation"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;PrimeIntellect&amp;rsquo;s
&lt;a href="https://github.com/PrimeIntellect-ai/OpenDiloco"&gt;OpenDiLoCo&lt;/a&gt; provides a
reproducible drop-in implementation, demonstrated training across two continents
and three countries with 90–95% compute utilisation. It later served as the
foundation for INTELLECT-1, a 10B-parameter model trained globally.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;open_diloco.hivemind_diloco&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DiLoCoOptimizer&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;inner_optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;4e-4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;outer_optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SGD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nesterov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DiLoCoOptimizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;dht&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dht&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;num_inner_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# sync every 500 steps, 500× fewer communications&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;inner_optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inner_optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outer_optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;outer_optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;h3 class="heading" id="why-diloco-works-on-a-single-node-snoo"&gt;
 Why DiLoCo works on a single node: SNOO&lt;span class="heading__anchor"&gt; &lt;a href="#why-diloco-works-on-a-single-node-snoo"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Kallusky et al. (2025)&lt;/strong&gt; show in &lt;em&gt;SNOO: Step-K Nesterov Outer Optimizer&lt;/em&gt; that
DiLoCo&amp;rsquo;s effectiveness, even on a single node, stems from applying &lt;strong&gt;Nesterov
momentum to the pseudo-gradient&lt;/strong&gt;. Their method isolates this as a standalone
Lookahead variant. Results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;1.5–2.5× FLOPs efficiency&lt;/strong&gt; gains up to $10^{23}$ training FLOPs.&lt;/li&gt;
&lt;li&gt;Improvements &lt;em&gt;increase&lt;/em&gt; with model size.&lt;/li&gt;
&lt;li&gt;Compatible with both AdamW and Muon as inner optimizers.&lt;/li&gt;
&lt;li&gt;Minimal memory overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The single-worker DiLoCo achieves speedups of up to &lt;strong&gt;6.32%&lt;/strong&gt; in steps-to-loss
over AdamW on a 160M Llama model.&lt;/p&gt;
&lt;h3 class="heading" id="smoothing-diloco-generalized-primal-averaging-gpa"&gt;
 Smoothing DiLoCo: Generalized Primal Averaging (GPA)&lt;span class="heading__anchor"&gt; &lt;a href="#smoothing-diloco-generalized-primal-averaging-gpa"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio et al. (2025/2026)&lt;/strong&gt; propose &lt;strong&gt;GPA&lt;/strong&gt; in &lt;em&gt;Smoothing DiLoCo with Primal
Averaging for Faster Training of LLMs&lt;/em&gt;
(&lt;a href="https://arxiv.org/abs/2512.17131"&gt;arXiv:2512.17131&lt;/a&gt;), which decouples
DiLoCo&amp;rsquo;s interpolation constants to enable smooth iterate averaging at every
step, replacing uniform averaging with exponential moving averaging.&lt;/p&gt;
&lt;p&gt;GPA unifies single-worker DiLoCo and ScheduleFree within a single non-distributed
framework. Speedups over AdamW in steps-to-target-loss:&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Model&lt;/th&gt;
					&lt;th style="text-align: right"&gt;Speedup&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Llama-160M&lt;/td&gt;
					&lt;td style="text-align: right"&gt;8.71%&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Llama-1B&lt;/td&gt;
					&lt;td style="text-align: right"&gt;10.13%&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Llama-8B&lt;/td&gt;
					&lt;td style="text-align: right"&gt;9.58%&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 class="heading" id="streaming-diloco-towards-free-distributed-training"&gt;
 Streaming DiLoCo: towards free distributed training&lt;span class="heading__anchor"&gt; &lt;a href="#streaming-diloco-towards-free-distributed-training"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Douillard et al. (2025)&lt;/strong&gt; address the remaining bottleneck in &lt;em&gt;Streaming
DiLoCo with Overlapping Communication: Towards a Distributed Free Lunch&lt;/em&gt;
(&lt;a href="https://arxiv.org/abs/2501.18512"&gt;arXiv:2501.18512&lt;/a&gt;): even with infrequent
synchronisation, each sync exchanges all parameters simultaneously. Three fixes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Streaming sync:&lt;/strong&gt; synchronise only subsets of parameters at a time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Overlapping communication:&lt;/strong&gt; continue training during synchronisation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantisation:&lt;/strong&gt; reduce cross-worker data to fewer bits.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Together, required bandwidth drops by &lt;strong&gt;two orders of magnitude&lt;/strong&gt; while
maintaining comparable quality at billion-parameter scale.&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Method&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Setting&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Key contribution&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Gain&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;SNOO&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Single-node&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Nesterov momentum on pseudo-gradient&lt;/td&gt;
					&lt;td style="text-align: left"&gt;1.5–2.5× FLOP efficiency&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;GPA&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Single-node&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Smooth iterate averaging; unifies DiLoCo + SF&lt;/td&gt;
					&lt;td style="text-align: left"&gt;~9% steps-to-loss&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Streaming DiLoCo&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Distributed&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Streaming sync + quantisation&lt;/td&gt;
					&lt;td style="text-align: left"&gt;~100× bandwidth reduction&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="6-cross-cutting-themes-and-open-questions"&gt;
 6. Cross-Cutting Themes and Open Questions&lt;span class="heading__anchor"&gt; &lt;a href="#6-cross-cutting-themes-and-open-questions"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Several recurrent tensions emerge from reading these papers together.&lt;/p&gt;
&lt;h3 class="heading" id="geometry-vs-step-size-calibration-in-muon"&gt;
 Geometry vs. step-size calibration in Muon&lt;span class="heading__anchor"&gt; &lt;a href="#geometry-vs-step-size-calibration-in-muon"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Kovalev, Pethick et al., and Amsel et al. offer geometric explanations for
Muon&amp;rsquo;s success. Shumaylov et al. argue that geometry is practically irrelevant
and step-size optimality is the true driver. Which narrative guides future
research matters: geometry points toward more sophisticated matrix norms; the
step-size interpretation suggests much simpler paths to similar gains.&lt;/p&gt;
&lt;h3 class="heading" id="what-µp-is-actually-doing"&gt;
 What µP is actually doing&lt;span class="heading__anchor"&gt; &lt;a href="#what-%c2%b5p-is-actually-doing"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Kosson et al. argue µP is primarily an implicit warmup mechanism. Kalra &amp;amp;
Barkeshli argue it is essentially about the embedding layer LR. Both stand in
contrast to µP&amp;rsquo;s original geometric motivation. The practical stakes are high:
the warmup interpretation suggests µP can be discarded with a schedule change;
the embedding LR interpretation suggests a single-line fix.&lt;/p&gt;
&lt;h3 class="heading" id="weight-decay-as-a-multi-role-hyperparameter"&gt;
 Weight decay as a multi-role hyperparameter&lt;span class="heading__anchor"&gt; &lt;a href="#weight-decay-as-a-multi-role-hyperparameter"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Weight decay appears as a protagonist in three independent stories in this
survey:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Defazio:&lt;/strong&gt; source of end-of-training gradient spikes via interaction with
normalisation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kosson et al.:&lt;/strong&gt; the true driver of LR transfer, not µP geometry.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kalra &amp;amp; Barkeshli:&lt;/strong&gt; improves scaling law fits but &lt;em&gt;hurts&lt;/em&gt; extrapolation
robustness.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is no longer tenable to treat weight decay as a simple regulariser with a
sensible default. It must be understood per-layer and in interaction with your
normalisation strategy.&lt;/p&gt;
&lt;h3 class="heading" id="diloco-as-the-practical-distributed-optimizer"&gt;
 DiLoCo as the practical distributed optimizer&lt;span class="heading__anchor"&gt; &lt;a href="#diloco-as-the-practical-distributed-optimizer"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Despite a large body of research on distributed optimizers, DiLoCo and its
derivatives appear to be the only methods that consistently add value beyond
simply scaling the batch size. The finding that its benefits carry over to
single-node settings (via SNOO and GPA) makes it a particularly important line
of work for practitioners at all scales.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="practical-recommendations-for-2026"&gt;
 Practical Recommendations for 2026&lt;span class="heading__anchor"&gt; &lt;a href="#practical-recommendations-for-2026"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Based on the convergence of evidence across these papers, for a new large
training run consider:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Optimizer:&lt;/strong&gt; Muon for hidden-layer matrix weights + AdamW for
embeddings/head. The Moonlight scaling fixes (weight decay + update scale
adjustment) are necessary above ~1B parameters.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schedule:&lt;/strong&gt; ScheduleFree+ or linear decay instead of cosine. If you need a
fixed-horizon schedule, WSD with higher $\beta_2$ during cooldown.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weight decay:&lt;/strong&gt; Disable it for layers directly followed by normalisation to
avoid end-of-training gradient spikes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Outer optimizer:&lt;/strong&gt; Wrap your training loop with single-worker DiLoCo (SNOO
or GPA) for a ~9% efficiency gain with no architectural changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;µP alternatives:&lt;/strong&gt; Before adopting full µP overhead, try increasing the
embedding layer LR by a factor of $d_{\text{model}} / d_{\text{proxy}}$.
This may reproduce most of the benefit.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;None of these require fundamental architectural changes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="references"&gt;
 References&lt;span class="heading__anchor"&gt; &lt;a href="#references"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th&gt;#&lt;/th&gt;
					&lt;th&gt;Paper&lt;/th&gt;
					&lt;th&gt;Venue&lt;/th&gt;
					&lt;th&gt;Links&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td&gt;1&lt;/td&gt;
					&lt;td&gt;Jordan et al. (2024): &lt;em&gt;Muon: An optimizer for hidden layers&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://kellerjordan.github.io/posts/muon/"&gt;blog&lt;/a&gt; · &lt;a href="https://github.com/KellerJordan/Muon"&gt;GitHub&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;2&lt;/td&gt;
					&lt;td&gt;Liu et al. (2025): &lt;em&gt;Muon is Scalable for LLM Training&lt;/em&gt; (Moonlight)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2502.16982"&gt;arXiv:2502.16982&lt;/a&gt; · &lt;a href="https://github.com/MoonshotAI/Moonlight"&gt;GitHub&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;3&lt;/td&gt;
					&lt;td&gt;Kovalev (2025): &lt;em&gt;Understanding Gradient Orthogonalization&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;4&lt;/td&gt;
					&lt;td&gt;Pethick et al. (2025): &lt;em&gt;Training Deep Learning Models with Norm-Constrained LMOs&lt;/em&gt; (Scion)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2502.07529"&gt;arXiv:2502.07529&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;5&lt;/td&gt;
					&lt;td&gt;Amsel et al. (2025): &lt;em&gt;The Polar Express&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;6&lt;/td&gt;
					&lt;td&gt;Shumaylov et al. (2026): &lt;em&gt;Muon is Not That Special&lt;/em&gt; (Freon/Kaon)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;7&lt;/td&gt;
					&lt;td&gt;Defazio et al. (2023): &lt;em&gt;Optimal Linear Decay Learning Rate Schedules&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2310.07831"&gt;arXiv:2310.07831&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;8&lt;/td&gt;
					&lt;td&gt;Dremov et al. (2025): &lt;em&gt;Training Dynamics of the Cooldown Stage in WSD&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;9&lt;/td&gt;
					&lt;td&gt;Schaipp et al. (2025): &lt;em&gt;Surprising Agreement Between Convex Theory and LR Scheduling&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;10&lt;/td&gt;
					&lt;td&gt;Meterez et al. (2026): &lt;em&gt;Anytime Pretraining&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2602.03702"&gt;arXiv:2602.03702&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;11&lt;/td&gt;
					&lt;td&gt;Defazio (2026): &lt;em&gt;ScheduleFree+&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2605.19095"&gt;arXiv:2605.19095&lt;/a&gt; · &lt;a href="https://github.com/facebookresearch/schedule_free"&gt;GitHub&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;12&lt;/td&gt;
					&lt;td&gt;Kosson et al. (2026): &lt;em&gt;Weight Decay May Matter More than µP&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;ICLR 2026&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;13&lt;/td&gt;
					&lt;td&gt;Kalra &amp;amp; Barkeshli (2026): &lt;em&gt;Quantifying HP Transfer and Embedding LR&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;14&lt;/td&gt;
					&lt;td&gt;Defazio (2025): &lt;em&gt;Why Gradients Rapidly Increase Near End of Training&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;15&lt;/td&gt;
					&lt;td&gt;Fu et al. (2025): &lt;em&gt;Nemotron-Flash&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;NeurIPS 2025&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;16&lt;/td&gt;
					&lt;td&gt;Yuan et al. (2025): &lt;em&gt;MARS&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;17&lt;/td&gt;
					&lt;td&gt;Semenov et al. (2025): &lt;em&gt;Benchmarking Optimizers for LLM Pretraining&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;18&lt;/td&gt;
					&lt;td&gt;Kallusky et al. (2025): &lt;em&gt;SNOO&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;19&lt;/td&gt;
					&lt;td&gt;Defazio et al. (2026): &lt;em&gt;Smoothing DiLoCo with Primal Averaging (GPA)&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2512.17131"&gt;arXiv:2512.17131&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;20&lt;/td&gt;
					&lt;td&gt;Douillard et al. (2025): &lt;em&gt;Streaming DiLoCo&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2501.18512"&gt;arXiv:2501.18512&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;21&lt;/td&gt;
					&lt;td&gt;Douillard et al. (2023/2024): &lt;em&gt;DiLoCo&lt;/em&gt; (original)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2311.08105"&gt;arXiv:2311.08105&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;22&lt;/td&gt;
					&lt;td&gt;PrimeIntellect AI (2024): &lt;em&gt;OpenDiLoCo&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://github.com/PrimeIntellect-ai/OpenDiloco"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.primeintellect.ai/blog/opendiloco"&gt;blog&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;</description></item></channel></rss>