<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Optimization on Nam Le</title><link>https://blog.namln.org/en/tags/optimization/</link><description>Recent content in Optimization on Nam Le</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Thu, 28 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.namln.org/en/tags/optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>Recent Advances in Neural Network Optimization for LLM Training</title><link>https://blog.namln.org/en/posts/llm-optimization-2025-survey/</link><pubDate>Thu, 28 May 2026 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/posts/llm-optimization-2025-survey/</guid><description>&lt;p&gt;The optimization landscape for LLM training looks very different from two years
ago. AdamW still dominates production runs, but a wave of research is eroding
that dominance from multiple angles simultaneously: matrix-aware optimizers,
horizon-free schedulers, a sharply revised understanding of µP, and
communication-efficient distributed methods. This post synthesizes 18 recent
papers across five interconnected fronts.&lt;/p&gt;
&lt;p&gt;The unifying thread is an active re-examination of long-held assumptions, from
whether gradient geometry matters, to what µP is actually doing, to whether
weight decay is a regularizer at all.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="1-muon-and-non-euclidean-optimizers"&gt;
 1. Muon and Non-Euclidean Optimizers&lt;span class="heading__anchor"&gt; &lt;a href="#1-muon-and-non-euclidean-optimizers"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="background"&gt;
 Background&lt;span class="heading__anchor"&gt; &lt;a href="#background"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Muon&lt;/strong&gt; (&lt;em&gt;&lt;strong&gt;Mo&lt;/strong&gt;mentum &lt;strong&gt;U&lt;/strong&gt;rthog&lt;/em&gt;&lt;em&gt;on&lt;/em&gt;*alized by Newton-Schulz*) applies a
gradient orthogonalization step via a Newton-Schulz iteration before each weight
update. Rather than treating each parameter as an independent scalar (as Adam
does), Muon recognizes that weight matrices have geometric structure and
optimizes them accordingly, performing steepest descent under the &lt;strong&gt;spectral
norm&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The core Newton-Schulz iteration, which runs stably in &lt;code&gt;bfloat16&lt;/code&gt; on tensor
cores, is:&lt;/p&gt;
&lt;p&gt;$$
X \leftarrow aX + b(XX^\top)X + c(XX^\top)^2 X
$$&lt;/p&gt;
&lt;p&gt;with coefficients $a = 3.4445$, $b = -4.7750$, $c = 2.0315$. In PyTorch:&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;newtonschulz5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;3.4445&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;4.7750&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.0315&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;A ready-to-use implementation lives at
&lt;a href="https://github.com/KellerJordan/Muon"&gt;KellerJordan/Muon&lt;/a&gt;. Install via:&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install git+https://github.com/KellerJordan/Muon&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;Muon is intended for hidden-layer matrix weights only. Embeddings, the output
head, and scalar/vector parameters should still use AdamW:&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;muon&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MuonWithAuxAdam&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hidden_matrix_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;embed&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;embed_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;embed&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;scalar_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndim&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;head_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lm_head&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MuonWithAuxAdam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;muon_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hidden_matrix_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;adamw_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embed_params&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;scalar_params&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;head_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;adamw_lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;adamw_wd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# LR has built-in muP scaling, so no retuning is needed as you scale up&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;h3 class="heading" id="scaling-muon-the-moonlight-result"&gt;
 Scaling Muon: the Moonlight result&lt;span class="heading__anchor"&gt; &lt;a href="#scaling-muon-the-moonlight-result"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;MoonshotAI&amp;rsquo;s &lt;strong&gt;Moonlight&lt;/strong&gt; (3B/16B-parameter MoE, trained on 5.7T tokens)
provides the strongest evidence yet that Muon scales to real LLM training
(&lt;a href="https://arxiv.org/abs/2502.16982"&gt;arXiv:2502.16982&lt;/a&gt;,
&lt;a href="https://github.com/MoonshotAI/Moonlight"&gt;GitHub&lt;/a&gt;). Two fixes are needed to
make Muon work beyond small scale:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Weight decay:&lt;/strong&gt; without it, weight and output RMS norms grow until they
overflow &lt;code&gt;bfloat16&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per-parameter update scale adjustment:&lt;/strong&gt; matching the RMS update norm of
AdamW by a factor of $\sqrt{(1-\beta_1)/(1+\beta_1)}$.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With these in place, scaling-law experiments indicate roughly &lt;strong&gt;2× computational
efficiency&lt;/strong&gt; compared to AdamW at compute-optimal settings.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Train a Qwen-like dense model with Muon (from Moonlight repo)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python3 examples/toy_train.py &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --model qwen --optimizer muon &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --dataset openwebtext-100k &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --hidden_size &lt;span class="m"&gt;896&lt;/span&gt; --lr 1e-3&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;A further efficiency variant is
&lt;a href="https://github.com/nil0x9/flash-muon"&gt;Flash-Muon&lt;/a&gt;, which reimplements the
Newton-Schulz inner loop using a custom Triton kernel that exploits the symmetry
of the $XX^\top$ computation, halving the effective FLOP count.&lt;/p&gt;
&lt;h3 class="heading" id="theoretical-foundations"&gt;
 Theoretical foundations&lt;span class="heading__anchor"&gt; &lt;a href="#theoretical-foundations"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Kovalev (2025)&lt;/strong&gt; shows in &lt;em&gt;Understanding Gradient Orthogonalization via
Non-Euclidean Trust-Region Optimization&lt;/em&gt; that the orthogonalized gradient update
can be interpreted as a first-order trust-region method where the trust-region is
defined in terms of the matrix spectral norm. This framework unifies Muon with
normalized SGD and signSGD with momentum.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pethick et al. (2025)&lt;/strong&gt; propose &lt;strong&gt;Scion&lt;/strong&gt;, a family of LMO-based algorithms
that subsumes Muon, AdamW, and normalized SGD under a single framework
(&lt;a href="https://arxiv.org/abs/2502.07529"&gt;arXiv:2502.07529&lt;/a&gt;). By choosing an explicit
norm for deep architectures, Scion also achieves hyperparameter transferability
across model widths.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Polar Express&lt;/strong&gt; (Amsel et al., 2025) replaces Newton-Schulz with a minimax
polar decomposition, solving a minimax problem at each iteration to minimize
worst-case error. It converges faster than Newton-Schulz in both early and
asymptotic stages, while remaining numerically stable in &lt;code&gt;bfloat16&lt;/code&gt;.&lt;/p&gt;
&lt;h3 class="heading" id="challenging-the-geometric-narrative"&gt;
 Challenging the geometric narrative&lt;span class="heading__anchor"&gt; &lt;a href="#challenging-the-geometric-narrative"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Despite the theoretical appeal, &lt;strong&gt;Shumaylov et al. (2026)&lt;/strong&gt; mount a systematic
challenge in &lt;em&gt;Muon is Not That Special: Random or Inverted Spectra Work Just as
Well&lt;/em&gt;. They introduce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Freon:&lt;/strong&gt; a family of optimizers based on Schatten (quasi-)norms,
interpolating between SGD and Muon. The best-performing Schatten parameter for
GPT-2 lies in the &lt;em&gt;quasi-norm&lt;/em&gt; regime, which no LMO-based optimizer can
represent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kaon:&lt;/strong&gt; replaces Muon&amp;rsquo;s singular values with random noise, yet still
matches Muon&amp;rsquo;s validation loss on GPT-2.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Their key insight: performance is primarily controlled by two local quantities,
&lt;em&gt;alignment&lt;/em&gt; (how well the update direction aligns with the gradient) and &lt;em&gt;descent
potential&lt;/em&gt; (step-size optimality). Muon succeeds by guaranteeing step-size
optimality, not by tracking an ideal geometry.&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Optimizer&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Core mechanism&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Key claim&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Muon&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Newton-Schulz orthogonalization&lt;/td&gt;
					&lt;td style="text-align: left"&gt;~2× efficiency over AdamW at compute-optimal&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Scion&lt;/td&gt;
					&lt;td style="text-align: left"&gt;LMO over norm-ball&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Unifies Muon/Adam; HP transferable across widths&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Polar Express&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Minimax polar decomposition&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Faster convergence; bfloat16-safe&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Freon / Kaon&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Schatten quasi-norms / random SVs&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Geometry is irrelevant; alignment drives performance&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="2-learning-rate-scheduling"&gt;
 2. Learning Rate Scheduling&lt;span class="heading__anchor"&gt; &lt;a href="#2-learning-rate-scheduling"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="linear-decay-is-provably-optimal"&gt;
 Linear decay is provably optimal&lt;span class="heading__anchor"&gt; &lt;a href="#linear-decay-is-provably-optimal"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio et al. (2023/2024)&lt;/strong&gt; close a long-standing gap between theory and
practice in &lt;em&gt;Optimal Linear Decay Learning Rate Schedules and Further
Refinements&lt;/em&gt; (&lt;a href="https://arxiv.org/abs/2310.07831"&gt;arXiv:2310.07831&lt;/a&gt;). Under
worst-case analysis, &lt;strong&gt;linear decay&lt;/strong&gt;, setting $\eta_t \propto (1 - t/T)$, is
the theoretically optimal schedule for a broad class of optimizers including SGD.
Across 10 diverse benchmarks, it consistently outperforms cosine annealing.&lt;/p&gt;
&lt;p&gt;$$
\eta_t = \eta_{\max} \cdot \left(1 - \frac{t}{T}\right)
$$&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# PyTorch built-in, the optimal default&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lr_scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LinearLR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_factor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_factor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;total_steps&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;h3 class="heading" id="the-wsd-cooldown-phase"&gt;
 The WSD cooldown phase&lt;span class="heading__anchor"&gt; &lt;a href="#the-wsd-cooldown-phase"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;The Warmup-Stable-Decay (WSD) scheduler separates training into distinct phases
ending in a sharp LR drop. &lt;strong&gt;Dremov et al. (2025)&lt;/strong&gt; analyse the cooldown phase
specifically in &lt;em&gt;Training Dynamics of the Cooldown Stage in WSD&lt;/em&gt;, finding:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cooldown shapes that balance exploration and exploitation consistently
outperform purely exploratory or exploitative alternatives.&lt;/li&gt;
&lt;li&gt;There is substantial sensitivity to AdamW&amp;rsquo;s $\beta_2$ parameter during
cooldown, and &lt;strong&gt;higher $\beta_2$ values yield consistent improvements&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Loss-landscape visualisations support the &amp;ldquo;river valley&amp;rdquo; perspective: the
cooldown follows a narrow valley in parameter space.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 class="heading" id="convex-theory-meets-llm-practice"&gt;
 Convex theory meets LLM practice&lt;span class="heading__anchor"&gt; &lt;a href="#convex-theory-meets-llm-practice"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Schaipp et al. (2025)&lt;/strong&gt; show in &lt;em&gt;The Surprising Agreement Between Convex
Optimization Theory and Learning-Rate Scheduling for Large Model Training&lt;/em&gt; that
schedules for large model training obey performance bounds from non-smooth convex
optimisation. For the constant schedule with linear cooldown, the bound is:&lt;/p&gt;
&lt;p&gt;$$
\bar{f}&lt;em&gt;T - f^* \leq \frac{|x_0 - x^*|^2}{2\eta T} + \frac{\eta}{2} \sum&lt;/em&gt;{t=0}^{T-1} \sigma_t^2
$$&lt;/p&gt;
&lt;p&gt;where the cooldown benefit appears explicitly through the absence of logarithmic
terms. This enables &lt;strong&gt;principled LR transfer&lt;/strong&gt;: exploiting the theory yields
noticeable validation loss improvements for 124M and 210M Llama-type models when
extending schedules for continued training.&lt;/p&gt;
&lt;h3 class="heading" id="anytime-schedules-and-weight-averaging"&gt;
 Anytime schedules and weight averaging&lt;span class="heading__anchor"&gt; &lt;a href="#anytime-schedules-and-weight-averaging"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Meterez et al. (2026)&lt;/strong&gt; prove in &lt;em&gt;Anytime Pretraining: Horizon-Free
Learning-Rate Schedules with Weight Averaging&lt;/em&gt;
(&lt;a href="https://arxiv.org/abs/2602.03702"&gt;arXiv:2602.03702&lt;/a&gt;) that horizon-free (anytime)
schedules exist for overparameterised linear regression, with &lt;strong&gt;weight averaging&lt;/strong&gt;
central to achieving minimax-optimal convergence. At 150M–300M params trained at
1–32× Chinchilla scale, a constant LR with weight averaging matches well-tuned
cosine decay across the full training duration.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Weight averaging is a largely underutilised practical lever. It should be a
default, not an afterthought.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 class="heading" id="schedulefree-at-llm-scale"&gt;
 ScheduleFree+ at LLM scale&lt;span class="heading__anchor"&gt; &lt;a href="#schedulefree-at-llm-scale"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio (2026)&lt;/strong&gt; extends schedule-free learning to full LLM pretraining in
&lt;em&gt;ScheduleFree+: Scaling Learning-Rate-Free and Schedule-Free Learning to Large
Language Models&lt;/em&gt; (&lt;a href="https://arxiv.org/abs/2605.19095"&gt;arXiv:2605.19095&lt;/a&gt;).
Practical fixes for large batch and model sizes enable ScheduleFree+ to achieve
a &lt;strong&gt;31% improvement&lt;/strong&gt; over WSD schedules at 1000 tokens per parameter, while
also providing a theoretical foundation for checkpoint merging during pretraining.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install schedulefree&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;
&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;schedulefree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AdamWScheduleFree&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AdamWScheduleFree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warmup_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Must switch to eval mode before evaluation&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;val_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;GitHub: &lt;a href="https://github.com/facebookresearch/schedule_free"&gt;facebookresearch/schedule_free&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="3-hyperparameter-transfer-and-scaling-laws-µp"&gt;
 3. Hyperparameter Transfer and Scaling Laws (µP)&lt;span class="heading__anchor"&gt; &lt;a href="#3-hyperparameter-transfer-and-scaling-laws-%c2%b5p"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="weight-decay-as-the-true-driver-of-lr-transfer"&gt;
 Weight decay as the true driver of LR transfer&lt;span class="heading__anchor"&gt; &lt;a href="#weight-decay-as-the-true-driver-of-lr-transfer"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;The Maximal Update Parameterisation (µP) is widely used to transfer optimal
learning rates from proxy models to large ones without re-tuning. &lt;strong&gt;Kosson et al.
(2025/2026)&lt;/strong&gt;, accepted to ICLR 2026, provide a large-scale empirical refutation
of the standard µP narrative in &lt;em&gt;Weight Decay May Matter More than µP for
Learning Rate Transfer in Practice&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Their finding: µP&amp;rsquo;s geometric alignment assumptions, which require alignment
between a layer&amp;rsquo;s inputs, weights, and gradient updates, hold only &lt;strong&gt;briefly at
the start of training&lt;/strong&gt;. For the remainder, it is &lt;strong&gt;weight decay&lt;/strong&gt; that
stabilises update dynamics across widths and facilitates LR transfer. This
implies µP&amp;rsquo;s scaling primarily acts as an implicit warmup, and can be largely
replaced by modified warmup schedules.&lt;/p&gt;
&lt;h3 class="heading" id="embedding-layer-lr-as-the-key-factor"&gt;
 Embedding layer LR as the key factor&lt;span class="heading__anchor"&gt; &lt;a href="#embedding-layer-lr-as-the-key-factor"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Kalra &amp;amp; Barkeshli (2026)&lt;/strong&gt; provide complementary evidence in &lt;em&gt;Quantifying
Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate&lt;/em&gt;,
tracing µP&amp;rsquo;s advantage over standard parameterisation (SP) to a single factor:
the &lt;strong&gt;embedding layer learning rate&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In SP, the embedding LR acts as a training bottleneck. Simply increasing it by a
factor of model width, matching µP, eliminates most of the gap. Three
quantitative metrics are used: quality of scaling law fit, robustness to
extrapolation errors, and asymptotic loss penalty.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;span class="lnt"&gt;9
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simple fix that captures most of µP&amp;#39;s benefit in SP&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;embed_lr_multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;base_width&lt;/span&gt; &lt;span class="c1"&gt;# = d_model / d_model_proxy&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;param_groups&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;lr&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_lr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;embed_lr_multiplier&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;non_embed_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;lr&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;base_lr&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_groups&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight_decay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;&lt;strong&gt;Open question:&lt;/strong&gt; Kosson et al. argue µP acts as an implicit warmup; Kalra &amp;amp;
Barkeshli argue it is about the embedding LR. Both contradict µP&amp;rsquo;s original
geometric motivation. No consensus has emerged, and the practical implications
differ significantly.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="4-normalization-weight-decay-and-variance-reduction"&gt;
 4. Normalization, Weight Decay, and Variance Reduction&lt;span class="heading__anchor"&gt; &lt;a href="#4-normalization-weight-decay-and-variance-reduction"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="the-end-of-training-gradient-spike"&gt;
 The end-of-training gradient spike&lt;span class="heading__anchor"&gt; &lt;a href="#the-end-of-training-gradient-spike"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio (2025)&lt;/strong&gt; identifies a subtle pathology in &lt;em&gt;Why Gradients Rapidly
Increase Near the End of Training&lt;/em&gt;: gradient norms spike sharply near the end of
long LLM runs. The diagnosis is a three-way interaction between &lt;strong&gt;weight decay&lt;/strong&gt;,
&lt;strong&gt;normalisation layers&lt;/strong&gt;, and the &lt;strong&gt;LR schedule&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;When a layer is followed by normalisation, its scale becomes irrelevant to the
forward pass, but weight decay continues shrinking the parameters. This creates
an implicit competition between the optimizer&amp;rsquo;s effective update size and
normalisation rescaling, causing gradient norms to grow unchecked as the LR
decays.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; disable weight decay for AdamW-updated layers in architectures where
those layers are directly followed by normalisation (e.g. every transformer
block):&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;no_wd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;param&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;norm&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;embed&amp;#34;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndim&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;no_wd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;wd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;wd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;weight_decay&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;params&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;no_wd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;weight_decay&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3e-4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;p&gt;This simultaneously eliminates the spike and reduces loss throughout training.
The analysis explains why weight decay should be disabled for AdamW-updated
layers in architectures like modded-nanoGPT.&lt;/p&gt;
&lt;h3 class="heading" id="weight-normalisation-as-an-alternative"&gt;
 Weight normalisation as an alternative&lt;span class="heading__anchor"&gt; &lt;a href="#weight-normalisation-as-an-alternative"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Nemotron-Flash&lt;/strong&gt; (Fu et al., 2025, NeurIPS 2025) investigates weight
normalisation as a practical mechanism in small language models, finding that it
enables more effective weight updates and improves final convergence. Weight
normalisation sidesteps the weight-decay/normalisation interaction described
above, though at the cost of slightly worse final loss compared to a well-tuned
baseline.&lt;/p&gt;
&lt;h3 class="heading" id="mars-variance-reduction-meets-preconditioned-gradients"&gt;
 MARS: variance reduction meets preconditioned gradients&lt;span class="heading__anchor"&gt; &lt;a href="#mars-variance-reduction-meets-preconditioned-gradients"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Despite decades of theoretical work, variance reduction has largely failed to
yield practical gains in deep learning. &lt;strong&gt;Yuan et al. (2024/2025)&lt;/strong&gt; attempt to
change this in &lt;em&gt;MARS: Unleashing the Power of Variance Reduction for Training
Large Models&lt;/em&gt;, proposing a unified framework that reconciles AdamW, Lion, and
Shampoo with variance reduction via a &lt;strong&gt;scaled stochastic recursive momentum&lt;/strong&gt;
technique.&lt;/p&gt;
&lt;p&gt;GPT-2 training results look strong. However, the comprehensive benchmark by
&lt;strong&gt;Semenov et al. (2025)&lt;/strong&gt;, &lt;em&gt;Benchmarking Optimizers for Large Language Model
Pretraining&lt;/em&gt;, a 73-page study covering 44 figures and 48 tables across
standardised scenarios, reveals that &lt;strong&gt;MARS does not work well with small batch
sizes&lt;/strong&gt;, limiting its practical applicability in memory-constrained settings.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This underscores the danger of evaluating optimizers on a single benchmark
setup: MARS looks excellent at the batch sizes used in the original paper and
brittle elsewhere.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="5-distributed-training-diloco-and-its-descendants"&gt;
 5. Distributed Training: DiLoCo and Its Descendants&lt;span class="heading__anchor"&gt; &lt;a href="#5-distributed-training-diloco-and-its-descendants"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;DiLoCo (Distributed Low-Communication training) uses AdamW as an &lt;em&gt;inner&lt;/em&gt;
optimizer for $H$ local steps on each worker (typically $H = 500$), then
synchronises by applying Nesterov momentum to the &lt;strong&gt;pseudo-gradient&lt;/strong&gt;, the sum
of all parameter changes across those inner steps. This reduces communication
frequency by up to 500×.&lt;/p&gt;
&lt;h3 class="heading" id="opendiloco-the-open-source-foundation"&gt;
 OpenDiLoCo: the open-source foundation&lt;span class="heading__anchor"&gt; &lt;a href="#opendiloco-the-open-source-foundation"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;PrimeIntellect&amp;rsquo;s
&lt;a href="https://github.com/PrimeIntellect-ai/OpenDiloco"&gt;OpenDiLoCo&lt;/a&gt; provides a
reproducible drop-in implementation, demonstrated training across two continents
and three countries with 90–95% compute utilisation. It later served as the
foundation for INTELLECT-1, a 10B-parameter model trained globally.&lt;/p&gt;

&lt;figure class="code-block"&gt;
 
 &lt;div class="highlight-wrapper"&gt;
 &lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;open_diloco.hivemind_diloco&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DiLoCoOptimizer&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;inner_optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;4e-4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;outer_optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SGD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;momentum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nesterov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DiLoCoOptimizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;dht&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dht&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;num_inner_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# sync every 500 steps, 500× fewer communications&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;inner_optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inner_optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outer_optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;outer_optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
 &lt;/div&gt;
&lt;/figure&gt;&lt;h3 class="heading" id="why-diloco-works-on-a-single-node-snoo"&gt;
 Why DiLoCo works on a single node: SNOO&lt;span class="heading__anchor"&gt; &lt;a href="#why-diloco-works-on-a-single-node-snoo"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Kallusky et al. (2025)&lt;/strong&gt; show in &lt;em&gt;SNOO: Step-K Nesterov Outer Optimizer&lt;/em&gt; that
DiLoCo&amp;rsquo;s effectiveness, even on a single node, stems from applying &lt;strong&gt;Nesterov
momentum to the pseudo-gradient&lt;/strong&gt;. Their method isolates this as a standalone
Lookahead variant. Results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;1.5–2.5× FLOPs efficiency&lt;/strong&gt; gains up to $10^{23}$ training FLOPs.&lt;/li&gt;
&lt;li&gt;Improvements &lt;em&gt;increase&lt;/em&gt; with model size.&lt;/li&gt;
&lt;li&gt;Compatible with both AdamW and Muon as inner optimizers.&lt;/li&gt;
&lt;li&gt;Minimal memory overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The single-worker DiLoCo achieves speedups of up to &lt;strong&gt;6.32%&lt;/strong&gt; in steps-to-loss
over AdamW on a 160M Llama model.&lt;/p&gt;
&lt;h3 class="heading" id="smoothing-diloco-generalized-primal-averaging-gpa"&gt;
 Smoothing DiLoCo: Generalized Primal Averaging (GPA)&lt;span class="heading__anchor"&gt; &lt;a href="#smoothing-diloco-generalized-primal-averaging-gpa"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Defazio et al. (2025/2026)&lt;/strong&gt; propose &lt;strong&gt;GPA&lt;/strong&gt; in &lt;em&gt;Smoothing DiLoCo with Primal
Averaging for Faster Training of LLMs&lt;/em&gt;
(&lt;a href="https://arxiv.org/abs/2512.17131"&gt;arXiv:2512.17131&lt;/a&gt;), which decouples
DiLoCo&amp;rsquo;s interpolation constants to enable smooth iterate averaging at every
step, replacing uniform averaging with exponential moving averaging.&lt;/p&gt;
&lt;p&gt;GPA unifies single-worker DiLoCo and ScheduleFree within a single non-distributed
framework. Speedups over AdamW in steps-to-target-loss:&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Model&lt;/th&gt;
					&lt;th style="text-align: right"&gt;Speedup&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Llama-160M&lt;/td&gt;
					&lt;td style="text-align: right"&gt;8.71%&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Llama-1B&lt;/td&gt;
					&lt;td style="text-align: right"&gt;10.13%&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Llama-8B&lt;/td&gt;
					&lt;td style="text-align: right"&gt;9.58%&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 class="heading" id="streaming-diloco-towards-free-distributed-training"&gt;
 Streaming DiLoCo: towards free distributed training&lt;span class="heading__anchor"&gt; &lt;a href="#streaming-diloco-towards-free-distributed-training"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Douillard et al. (2025)&lt;/strong&gt; address the remaining bottleneck in &lt;em&gt;Streaming
DiLoCo with Overlapping Communication: Towards a Distributed Free Lunch&lt;/em&gt;
(&lt;a href="https://arxiv.org/abs/2501.18512"&gt;arXiv:2501.18512&lt;/a&gt;): even with infrequent
synchronisation, each sync exchanges all parameters simultaneously. Three fixes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Streaming sync:&lt;/strong&gt; synchronise only subsets of parameters at a time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Overlapping communication:&lt;/strong&gt; continue training during synchronisation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantisation:&lt;/strong&gt; reduce cross-worker data to fewer bits.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Together, required bandwidth drops by &lt;strong&gt;two orders of magnitude&lt;/strong&gt; while
maintaining comparable quality at billion-parameter scale.&lt;/p&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th style="text-align: left"&gt;Method&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Setting&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Key contribution&lt;/th&gt;
					&lt;th style="text-align: left"&gt;Gain&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;SNOO&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Single-node&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Nesterov momentum on pseudo-gradient&lt;/td&gt;
					&lt;td style="text-align: left"&gt;1.5–2.5× FLOP efficiency&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;GPA&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Single-node&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Smooth iterate averaging; unifies DiLoCo + SF&lt;/td&gt;
					&lt;td style="text-align: left"&gt;~9% steps-to-loss&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td style="text-align: left"&gt;Streaming DiLoCo&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Distributed&lt;/td&gt;
					&lt;td style="text-align: left"&gt;Streaming sync + quantisation&lt;/td&gt;
					&lt;td style="text-align: left"&gt;~100× bandwidth reduction&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="6-cross-cutting-themes-and-open-questions"&gt;
 6. Cross-Cutting Themes and Open Questions&lt;span class="heading__anchor"&gt; &lt;a href="#6-cross-cutting-themes-and-open-questions"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Several recurrent tensions emerge from reading these papers together.&lt;/p&gt;
&lt;h3 class="heading" id="geometry-vs-step-size-calibration-in-muon"&gt;
 Geometry vs. step-size calibration in Muon&lt;span class="heading__anchor"&gt; &lt;a href="#geometry-vs-step-size-calibration-in-muon"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Kovalev, Pethick et al., and Amsel et al. offer geometric explanations for
Muon&amp;rsquo;s success. Shumaylov et al. argue that geometry is practically irrelevant
and step-size optimality is the true driver. Which narrative guides future
research matters: geometry points toward more sophisticated matrix norms; the
step-size interpretation suggests much simpler paths to similar gains.&lt;/p&gt;
&lt;h3 class="heading" id="what-µp-is-actually-doing"&gt;
 What µP is actually doing&lt;span class="heading__anchor"&gt; &lt;a href="#what-%c2%b5p-is-actually-doing"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Kosson et al. argue µP is primarily an implicit warmup mechanism. Kalra &amp;amp;
Barkeshli argue it is essentially about the embedding layer LR. Both stand in
contrast to µP&amp;rsquo;s original geometric motivation. The practical stakes are high:
the warmup interpretation suggests µP can be discarded with a schedule change;
the embedding LR interpretation suggests a single-line fix.&lt;/p&gt;
&lt;h3 class="heading" id="weight-decay-as-a-multi-role-hyperparameter"&gt;
 Weight decay as a multi-role hyperparameter&lt;span class="heading__anchor"&gt; &lt;a href="#weight-decay-as-a-multi-role-hyperparameter"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Weight decay appears as a protagonist in three independent stories in this
survey:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Defazio:&lt;/strong&gt; source of end-of-training gradient spikes via interaction with
normalisation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kosson et al.:&lt;/strong&gt; the true driver of LR transfer, not µP geometry.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kalra &amp;amp; Barkeshli:&lt;/strong&gt; improves scaling law fits but &lt;em&gt;hurts&lt;/em&gt; extrapolation
robustness.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is no longer tenable to treat weight decay as a simple regulariser with a
sensible default. It must be understood per-layer and in interaction with your
normalisation strategy.&lt;/p&gt;
&lt;h3 class="heading" id="diloco-as-the-practical-distributed-optimizer"&gt;
 DiLoCo as the practical distributed optimizer&lt;span class="heading__anchor"&gt; &lt;a href="#diloco-as-the-practical-distributed-optimizer"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;Despite a large body of research on distributed optimizers, DiLoCo and its
derivatives appear to be the only methods that consistently add value beyond
simply scaling the batch size. The finding that its benefits carry over to
single-node settings (via SNOO and GPA) makes it a particularly important line
of work for practitioners at all scales.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="practical-recommendations-for-2026"&gt;
 Practical Recommendations for 2026&lt;span class="heading__anchor"&gt; &lt;a href="#practical-recommendations-for-2026"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Based on the convergence of evidence across these papers, for a new large
training run consider:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Optimizer:&lt;/strong&gt; Muon for hidden-layer matrix weights + AdamW for
embeddings/head. The Moonlight scaling fixes (weight decay + update scale
adjustment) are necessary above ~1B parameters.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schedule:&lt;/strong&gt; ScheduleFree+ or linear decay instead of cosine. If you need a
fixed-horizon schedule, WSD with higher $\beta_2$ during cooldown.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weight decay:&lt;/strong&gt; Disable it for layers directly followed by normalisation to
avoid end-of-training gradient spikes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Outer optimizer:&lt;/strong&gt; Wrap your training loop with single-worker DiLoCo (SNOO
or GPA) for a ~9% efficiency gain with no architectural changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;µP alternatives:&lt;/strong&gt; Before adopting full µP overhead, try increasing the
embedding layer LR by a factor of $d_{\text{model}} / d_{\text{proxy}}$.
This may reproduce most of the benefit.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;None of these require fundamental architectural changes.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="references"&gt;
 References&lt;span class="heading__anchor"&gt; &lt;a href="#references"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th&gt;#&lt;/th&gt;
					&lt;th&gt;Paper&lt;/th&gt;
					&lt;th&gt;Venue&lt;/th&gt;
					&lt;th&gt;Links&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td&gt;1&lt;/td&gt;
					&lt;td&gt;Jordan et al. (2024): &lt;em&gt;Muon: An optimizer for hidden layers&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://kellerjordan.github.io/posts/muon/"&gt;blog&lt;/a&gt; · &lt;a href="https://github.com/KellerJordan/Muon"&gt;GitHub&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;2&lt;/td&gt;
					&lt;td&gt;Liu et al. (2025): &lt;em&gt;Muon is Scalable for LLM Training&lt;/em&gt; (Moonlight)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2502.16982"&gt;arXiv:2502.16982&lt;/a&gt; · &lt;a href="https://github.com/MoonshotAI/Moonlight"&gt;GitHub&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;3&lt;/td&gt;
					&lt;td&gt;Kovalev (2025): &lt;em&gt;Understanding Gradient Orthogonalization&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;4&lt;/td&gt;
					&lt;td&gt;Pethick et al. (2025): &lt;em&gt;Training Deep Learning Models with Norm-Constrained LMOs&lt;/em&gt; (Scion)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2502.07529"&gt;arXiv:2502.07529&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;5&lt;/td&gt;
					&lt;td&gt;Amsel et al. (2025): &lt;em&gt;The Polar Express&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;6&lt;/td&gt;
					&lt;td&gt;Shumaylov et al. (2026): &lt;em&gt;Muon is Not That Special&lt;/em&gt; (Freon/Kaon)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;7&lt;/td&gt;
					&lt;td&gt;Defazio et al. (2023): &lt;em&gt;Optimal Linear Decay Learning Rate Schedules&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2310.07831"&gt;arXiv:2310.07831&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;8&lt;/td&gt;
					&lt;td&gt;Dremov et al. (2025): &lt;em&gt;Training Dynamics of the Cooldown Stage in WSD&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;9&lt;/td&gt;
					&lt;td&gt;Schaipp et al. (2025): &lt;em&gt;Surprising Agreement Between Convex Theory and LR Scheduling&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;10&lt;/td&gt;
					&lt;td&gt;Meterez et al. (2026): &lt;em&gt;Anytime Pretraining&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2602.03702"&gt;arXiv:2602.03702&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;11&lt;/td&gt;
					&lt;td&gt;Defazio (2026): &lt;em&gt;ScheduleFree+&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2605.19095"&gt;arXiv:2605.19095&lt;/a&gt; · &lt;a href="https://github.com/facebookresearch/schedule_free"&gt;GitHub&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;12&lt;/td&gt;
					&lt;td&gt;Kosson et al. (2026): &lt;em&gt;Weight Decay May Matter More than µP&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;ICLR 2026&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;13&lt;/td&gt;
					&lt;td&gt;Kalra &amp;amp; Barkeshli (2026): &lt;em&gt;Quantifying HP Transfer and Embedding LR&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;14&lt;/td&gt;
					&lt;td&gt;Defazio (2025): &lt;em&gt;Why Gradients Rapidly Increase Near End of Training&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;15&lt;/td&gt;
					&lt;td&gt;Fu et al. (2025): &lt;em&gt;Nemotron-Flash&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;NeurIPS 2025&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;16&lt;/td&gt;
					&lt;td&gt;Yuan et al. (2025): &lt;em&gt;MARS&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;17&lt;/td&gt;
					&lt;td&gt;Semenov et al. (2025): &lt;em&gt;Benchmarking Optimizers for LLM Pretraining&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;18&lt;/td&gt;
					&lt;td&gt;Kallusky et al. (2025): &lt;em&gt;SNOO&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;19&lt;/td&gt;
					&lt;td&gt;Defazio et al. (2026): &lt;em&gt;Smoothing DiLoCo with Primal Averaging (GPA)&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2512.17131"&gt;arXiv:2512.17131&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;20&lt;/td&gt;
					&lt;td&gt;Douillard et al. (2025): &lt;em&gt;Streaming DiLoCo&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2501.18512"&gt;arXiv:2501.18512&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;21&lt;/td&gt;
					&lt;td&gt;Douillard et al. (2023/2024): &lt;em&gt;DiLoCo&lt;/em&gt; (original)&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://arxiv.org/abs/2311.08105"&gt;arXiv:2311.08105&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;22&lt;/td&gt;
					&lt;td&gt;PrimeIntellect AI (2024): &lt;em&gt;OpenDiLoCo&lt;/em&gt;&lt;/td&gt;
					&lt;td&gt;n/a&lt;/td&gt;
					&lt;td&gt;&lt;a href="https://github.com/PrimeIntellect-ai/OpenDiloco"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.primeintellect.ai/blog/opendiloco"&gt;blog&lt;/a&gt;&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;</description></item><item><title>Optimization Papers in JMLR Volume 26</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v26/</link><pubDate>Sun, 29 Sep 2024 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v26/</guid><description/></item><item><title>Optimization Research Papers in JMLR Volume 25</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v25/</link><pubDate>Sun, 29 Sep 2024 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v25/</guid><description>&lt;h1 class="heading" id="optimization-research-papers-in-jmlr-volume-25-2024"&gt;
 Optimization Research Papers in JMLR Volume 25 (2024)&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-research-papers-in-jmlr-volume-25-2024"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;&lt;p&gt;This document lists papers from JMLR Volume 25 (2024) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.&lt;/p&gt;
&lt;h2 class="heading" id="convex-optimization"&gt;
 Convex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#convex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing convex optimization problems, including sparse NMF, differential privacy, and sparse regression.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuze Han, Guangzeng Xie, Zhihua Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates lower complexity bounds for finite-sum optimization problems in convex settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse NMF with Archetypal Regularization: Computational and Robustness Properties&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kayhan Behdin, Rahul Mazumder&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes sparse non-negative matrix factorization with archetypal regularization using convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scaling the Convex Barrier with Sparse Dual Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops sparse dual algorithms for scaling convex optimization problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Faster Rates in Differentially Private Stochastic Convex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jinyan Su, Lijie Hu, Di Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes faster convergence rates for differentially private stochastic convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Estimation of Sparse Gaussian Graphical Models with Hidden Clustering Structure&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Meixia Lin, Defeng Sun, Kim-Chuan Toh, Chengjing Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops convex optimization methods for sparse Gaussian graphical models with hidden clustering.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Minimax Optimal Approach to High-Dimensional Double Sparse Linear Regression&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yanhang Zhang, Zhifan Li, Shixiang Liu, Jianxin Yin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a minimax optimal approach for high-dimensional double sparse linear regression using convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Inexact Projected Regularized Newton Method for Fused Zero-Norms Regularization Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuqia Wu, Shaohua Pan, Xiaoqi Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces an inexact projected regularized Newton method for fused zero-norms regularization in convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="nonconvex-optimization"&gt;
 Nonconvex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#nonconvex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers tackling nonconvex optimization, focusing on ADMM, Adam-family methods, and stochastic minimax optimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convergence for Nonconvex ADMM, with Applications to CT Imaging&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rina Foygel Barber, Emil Y. Sidky&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies convergence properties of nonconvex ADMM with applications to CT imaging.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adam-Family Methods for Nonsmooth Optimization with Convergence Guarantees&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops Adam-family methods for nonsmooth nonconvex optimization with convergence guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo under Local Conditions for Nonconvex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: O. Deniz Akyildiz, Sotirios Sabanis&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a nonasymptotic analysis of stochastic gradient Hamiltonian Monte Carlo for nonconvex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;High Probability Convergence Bounds for Non-Convex Stochastic Gradient Descent with Sub-Weibull Noise&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Liam Madden, Emiliano Dall&amp;rsquo;Anese, Stephen Becker&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Derives high-probability convergence bounds for nonconvex stochastic gradient descent with sub-Weibull noise.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Regularized Majorization-Minimization with Weakly Convex and Multi-Convex Surrogates&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hanbaek Lyu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes stochastic regularized majorization-minimization for weakly convex and multi-convex problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Near-Optimal Algorithms for Stochastic Minimax Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Lesi Chen, Luo Luo&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops near-optimal algorithms for stochastic minimax optimization in nonconvex settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Naoki Sato, Koshiro Izumi, Hideaki Iiduka&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a scaled conjugate gradient method for nonconvex optimization in deep neural networks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="stochastic-optimization"&gt;
 Stochastic Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on stochastic optimization methods, including continuous-time approximations, momentum, and curvature estimates.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Stefan Ankirchner, Stefan Perko&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Compares continuous-time approximations to stochastic gradient descent for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Generalization of Stochastic Gradient Descent with Momentum&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the generalization properties of stochastic gradient descent with momentum.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies stochastic modified flows and mean-field limits for stochastic gradient descent dynamics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates stochastic approximation with decision-dependent distributions, focusing on asymptotic normality and optimality.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Guy Kornowski, Ohad Shamir&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes an algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Hyperparameters in Stochastic Gradient Descent with Momentum&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Bin Shi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Examines the impact of hyperparameters in stochastic gradient descent with momentum.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Almost Sure Convergence Rates Analysis and Saddle Avoidance of Stochastic Gradient Methods&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jun Liu, Ye Yuan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes almost sure convergence rates and saddle avoidance in stochastic gradient methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zachary Frangella, Pratik Rathore, Shipu Zhao, Madeleine Udell&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces preconditioned stochastic optimization methods with scalable curvature estimates.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Zeroth-Order Stochastic Approximation Algorithms for DR-Submodular Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuefang Lian, Xiao Wang, Dachuan Xu, Zhongrui Zhao&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops zeroth-order stochastic approximation algorithms for DR-submodular optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic-Constrained Stochastic Optimization with Markovian Data&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yeongjong Kim, Dabeen Lee&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies stochastic-constrained optimization with Markovian data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;High Probability and Risk-Averse Guarantees for a Stochastic Accelerated Primal-Dual Method&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yassine Laguel, Necdet Serhat Aybat, Mert Gürbüzbalaban&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides high-probability and risk-averse guarantees for a stochastic accelerated primal-dual method.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="distributeddecentralized-optimization"&gt;
 Distributed/Decentralized Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#distributeddecentralized-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing distributed or decentralized optimization algorithms, focusing on communication efficiency and federated learning.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: T. Tony Cai, Hongji Wei&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops optimal rates and communication-efficient algorithms for distributed Gaussian mean estimation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerated Gradient Tracking over Time-Varying Graphs for Decentralized Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Huan Li, Zhouchen Lin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes accelerated gradient tracking for decentralized optimization over time-varying graphs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compressed and Distributed Least-Squares Regression: Convergence Rates with Applications to Federated Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Constantin Philippenko, Aymeric Dieuleveut&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence rates for compressed and distributed least-squares regression in federated learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Federated Automatic Differentiation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Keith Rush, Zachary Charles, Zachary Garrett&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces federated automatic differentiation for distributed optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Random Projection Approach to Personalized Federated Learning: Enhancing Communication Efficiency, Robustness, and Fairness&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuze Han, Xiang Li, Shiyun Lin, Zhihua Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a random projection approach to enhance communication efficiency in personalized federated learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Countering the Communication Bottleneck in Federated Learning: A Highly Efficient Zero-Order Optimization Technique&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Elissa Mhanna, Mohamad Assaad&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a zero-order optimization technique to address communication bottlenecks in federated learning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bandits-and-online-learning"&gt;
 Bandits and Online Learning&lt;span class="heading__anchor"&gt; &lt;a href="#bandits-and-online-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing multi-armed bandits, online optimization, and regret minimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zixian Yang, Xin Liu, Lei Ying&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies exploration, exploitation, and engagement in multi-armed bandits with abandonment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adaptivity and Non-Stationarity: Problem-Dependent Dynamic Regret for Online Convex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes problem-dependent dynamic regret for online convex optimization under non-stationarity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Materials Discovery Using Max K-Armed Bandit&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nobuaki Kikkawa, Hiroshi Ohno&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies max k-armed bandit algorithms to materials discovery, focusing on regret minimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Finite-Time Analysis of Globally Nonstationary Multi-Armed Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Junpei Komiyama, Edouard Fouché, Junya Honda&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides finite-time analysis for globally nonstationary multi-armed bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Sijia Chen, Yu-Jie Zhang, Wei-Wei Tu, Peng Zhao, Lijun Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops optimistic online mirror descent for bridging stochastic and adversarial online convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Continuous Prediction with Experts&amp;rsquo; Advice&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nicholas J. A. Harvey, Christopher Liaw, Victor S. Portella&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates continuous prediction with experts&amp;rsquo; advice in online learning settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Regret Analysis of Bilateral Trade with a Smoothed Adversary&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, Stefano Leonardi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes regret in bilateral trade with a smoothed adversary in online optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimal Learning Policies for Differential Privacy in Multi-Armed Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Siwei Wang, Jun Zhu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops optimal learning policies for differential privacy in multi-armed bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Information Capacity Regret Bounds for Bandits with Mediator Feedback&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Khaled Eldowa, Nicolò Cesa-Bianchi, Alberto Maria Metelli, Marcello Restelli&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Derives regret bounds for bandits with mediator feedback, focusing on information capacity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Aleksandrs Slivkins, Xingyu Zhou, Karthik Abinav Sankararaman, Dylan J. Foster&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a modular Lagrangian approach for contextual bandits with packing and covering constraints.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="optimization-in-reinforcement-learning"&gt;
 Optimization in Reinforcement Learning&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-in-reinforcement-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on optimization techniques for reinforcement learning, including policy gradient, actor-critic, and safe RL.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Shicong Cen, Yuting Wei, Yuejie Chi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops fast policy extragradient methods for competitive games with entropy regularization in RL.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sample-Efficient Adversarial Imitation Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dahuin Jung, Hyungyu Lee, Sungroh Yoon&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes sample-efficient adversarial imitation learning methods for RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Sample Complexity and Metastability of Heavy-Tailed Policy Search in Continuous Control&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes sample complexity and metastability for heavy-tailed policy search in continuous control.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops off-policy action anticipation methods for multi-agent RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Policy Gradient Methods in the Presence of Symmetries and State Abstractions&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates policy gradient methods with symmetries and state abstractions for RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Log Barriers for Safe Black-Box Optimization with Application to Safe Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ilnura Usmanova, Yarden As, Maryam Kamgarpour, Andreas Krause&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes log barriers for safe black-box optimization with applications to safe RL.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jinchi Chen, Jie Feng, Weiguo Gao, Ke Wei&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops decentralized natural policy gradient with variance reduction for multi-agent RL.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Laixi Shi, Yuejie Chi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies distributionally robust model-based offline RL with near-optimal sample complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhenghao Xu, Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes sample complexity of neural policy mirror descent for policy optimization on low-dimensional manifolds.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Mean-Field Approximation of Cooperative Constrained Multi-Agent Reinforcement Learning (CMARL)&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Washim Uddin Mondal, Vaneet Aggarwal, Satish V. Ukkusuri&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes mean-field approximations for cooperative constrained multi-agent RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Luofeng Liao, Zuyue Fu, Zhuoran Yang, Yixin Wang, Dingli Ma, Mladen Kolar, Zhaoran Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops instrumental variable value iteration for causal offline RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: François G. Ged, Maria Han Veiga&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a Matryoshka policy gradient method for entropy-regularized RL with convergence guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data-Efficient Policy Evaluation Through Behavior Policy Search&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Josiah P. Hanna, Yash Chandak, Philip S. Thomas, Martha White, Peter Stone, Scott Niekum&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes data-efficient policy evaluation methods for RL through behavior policy search.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Empirical Design in Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Andrew Patterson, Samuel Neumann, Martha White, Adam White&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates empirical design strategies for optimization in reinforcement learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A New, Physics-Informed Continuous-Time Reinforcement Learning Algorithm with Performance Guarantees&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Brent A. Wallace, Jennie Si&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a physics-informed continuous-time RL algorithm with performance guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="other-optimization-topics"&gt;
 Other Optimization Topics&lt;span class="heading__anchor"&gt; &lt;a href="#other-optimization-topics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers covering miscellaneous optimization topics, including optimal transport, bilevel optimization, and tensor recovery.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes efficient and scalable computation methods for nonparametric MLE in mixture models using optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tangential Wasserstein Projections&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Florian Gunsilius, Meng Hsuan Hsieh, Myung Jin Lee&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops tangential Wasserstein projections for optimization in optimal transport.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a weight-decay-integrated Nesterov acceleration method for faster network training.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xuxing Chen, Tesi Xiao, Krishnakumar Balasubramanian&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learning to Warm-Start Fixed-Point Optimization Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rajiv Sambharya, Georgina Hall, Brandon Amos, Bartolomeo Stellato&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes learning-based warm-start techniques for fixed-point optimization algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Wasserstein Proximal Coordinate Gradient Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rentian Yao, Xiaohui Chen, Yun Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops Wasserstein proximal coordinate gradient algorithms for optimal transport optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Convergence of Projected Alternating Maximization for Equitable and Optimal Transport&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Minhui Huang, Shiqian Ma, Lifeng Lai&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence of projected alternating maximization for equitable and optimal transport.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lower Complexity Adaptation for Empirical Entropic Optimal Transport&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Michel Groppe, Shayan Hundrieser&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes lower complexity adaptation methods for empirical entropic optimal transport.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerating Nuclear-Norm Regularized Low-Rank Matrix Optimization Through Burer-Monteiro Decomposition&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ching-pei Lee, Ling Liang, Tianyun Tang, Kim-Chuan Toh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces accelerated nuclear-norm regularized low-rank matrix optimization using Burer-Monteiro decomposition.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Guaranteed Nonconvex Factorization Approach for Tensor Train Recovery&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhen Qin, Michael B. Wakin, Zhihui Zhu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a guaranteed nonconvex factorization approach for tensor train recovery.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Infeasible Deterministic, Stochastic, and Variance-Reduction Algorithms for Optimization under Orthogonality Constraints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Pierre Ablin, Simon Vary, Bin Gao, Pierre-Antoine Absil&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes algorithms for optimization under orthogonality constraints, including deterministic, stochastic, and variance-reduction methods.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>Optimization Research Papers in JMLR Volume 24</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v24/</link><pubDate>Fri, 29 Sep 2023 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v24/</guid><description>&lt;h1 class="heading" id="optimization-research-papers-in-jmlr-volume-24-2023"&gt;
 Optimization Research Papers in JMLR Volume 24 (2023)&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-research-papers-in-jmlr-volume-24-2023"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;&lt;p&gt;This document lists papers from JMLR Volume 24 (2023) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.&lt;/p&gt;
&lt;h2 class="heading" id="convex-optimization"&gt;
 Convex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#convex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing convex optimization problems, including sparse PCA, L0 regularization, and matrix decomposition.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse PCA: A Geometric Approach&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dimitris Bertsimas, Driss Lahlou Kitane&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a geometric approach for sparse principal component analysis using convex optimization techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fundamental Limits and Algorithms for Sparse Linear Regression with Sublinear Sparsity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Lan V. Truong&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates algorithms and theoretical limits for sparse linear regression with sublinear sparsity in a convex framework.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse Training with Lipschitz Continuous Loss Functions and a Weighted Group L0-norm Constraint&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Michael R. Metel&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes sparse training methods using Lipschitz continuous loss functions and group L0-norm constraints.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MARS: A Second-Order Reduction Algorithm for High-Dimensional Sparse Precision Matrices Estimation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Qian Li, Binyan Jiang, Defeng Sun&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Presents a second-order reduction algorithm for sparse precision matrix estimation using convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse GCA and Thresholded Gradient Descent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Sheng Gao, Zongming Ma&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops sparse generalized correlation analysis with thresholded gradient descent in a convex framework.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Parameter-Free Conditional Gradient Method for Composite Minimization under Hölder Condition&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Masaru Ito, Zhaosong Lu, Chuan He&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a parameter-free conditional gradient method for composite minimization under Hölder smoothness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;L0Learn: A Scalable Package for Sparse Learning using L0 Regularization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hussein Hazimeh, Rahul Mazumder, Tim Nonet&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Presents a scalable package for sparse learning with L0 regularization in convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dimitris Bertsimas, Ryan Cory-Wright, Nicholas A. G. Johnson&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a discrete optimization approach for sparse plus low-rank matrix decomposition using convex methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributed Sparse Regression via Penalization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yao Ji, Gesualdo Scutari, Ying Sun, Harsha Honnappa&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops distributed sparse regression algorithms using penalization techniques in convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Elastic Gradient Descent, an Iterative Optimization Method Approximating the Solution Paths of the Elastic Net&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Oskar Allerbo, Johan Jonasson, Rebecka Jörnsten&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces an iterative method approximating elastic net solution paths in convex settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Novel Integer Linear Programming Approach for Global L0 Minimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Diego Delle Donne, Matthieu Kowalski, Leo Liberti&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes an integer linear programming approach for global L0 minimization in convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="nonconvex-optimization"&gt;
 Nonconvex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#nonconvex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers tackling nonconvex optimization, focusing on descent algorithms, majorization minimization, and minimax problems.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Line-Search Descent Algorithm for Strict Saddle Functions with Complexity Guarantees&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Michael J. O&amp;rsquo;Neill, Stephen J. Wright&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a line-search descent algorithm for nonconvex strict saddle functions with complexity guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Inertial Block Majorization Minimization Framework for Nonsmooth Nonconvex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Le Thi Khanh Hien, Duy Nhat Phan, Nicolas Gillis&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes an inertial block majorization minimization framework for nonsmooth nonconvex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the O(epsilon^(-7/4)) Complexity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Huan Li, Zhouchen Lin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a restarted accelerated gradient descent method for nonconvex optimization, eliminating polylogarithmic factors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Preconditioned Gradient Descent for Overparameterized Nonconvex Burer-Monteiro Factorization with Global Optimality Certification&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Gavin Zhang, Salar Fattahi, Richard Y. Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops preconditioned gradient descent for nonconvex Burer-Monteiro factorization with global optimality guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Zeroth-Order Alternating Gradient Descent Ascent Algorithms for A Class of Nonconvex-Nonconcave Minimax Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zi Xu, Zi-Qi Wang, Jun-Lin Wang, Yu-Hong Dai&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes zeroth-order alternating gradient descent ascent for nonconvex-nonconcave minimax problems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="stochastic-optimization"&gt;
 Stochastic Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on stochastic optimization methods, including gradient descent, proximal point methods, and continuous-time approaches.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Convergence of Stochastic Gradient Descent with Bandwidth-Based Step Size&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xiaoyu Wang, Ya-xiang Yuan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence of stochastic gradient descent with bandwidth-based step sizes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Optimization under Distributional Drift&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Joshua Cutler, Dmitriy Drusvyatskiy, Zaid Harchaoui&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies stochastic optimization under distributional drift with theoretical guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improved Powered Stochastic Optimization Algorithms for Large-Scale Machine Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhuang Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes improved powered stochastic optimization algorithms for large-scale machine learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xiao-Tong Yuan, Ping Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a sharper analysis of minibatch stochastic proximal point methods, focusing on stability and smoothness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Continuous-Time Stochastic Gradient Descent Method for Continuous Data&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kexin Jin, Jonas Latz, Chenguang Liu, Carola-Bibiane Schönlieb&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a continuous-time stochastic gradient descent method for continuous data optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sensitivity-Free Gradient Descent Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ion Matei, Maksym Zhenirovskyy, Johan de Kleer, John Maxwell&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops sensitivity-free gradient descent algorithms for stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="distributeddecentralized-optimization"&gt;
 Distributed/Decentralized Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#distributeddecentralized-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing distributed or decentralized optimization algorithms, focusing on federated learning, asynchronous updates, and network topology.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Decentralized Learning: Theoretical Optimality and Practical Improvements&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yucheng Lu, Christopher De Sa&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes theoretical optimality and practical improvements for decentralized learning algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A General Theory for Federated Optimization with Asynchronous and Heterogeneous Clients Updates&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yann Fraboni, Richard Vidal, Laetitia Kameni, Marco Lorenzi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a general theory for federated optimization with asynchronous and heterogeneous client updates.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Buffered Asynchronous SGD for Byzantine Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yi-Rui Yang, Wu-Jun Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes buffered asynchronous SGD for Byzantine-resilient distributed learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Minimax Estimation for Personalized Federated Learning: An Alternative Between FedAvg and Local Training&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Shuxiao Chen, Qinqing Zheng, Qi Long, Weijie J. Su&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates minimax estimation for personalized federated learning, comparing FedAvg and local training.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kun Yuan, Sulaiman A. Alghunaim, Xinmeng Huang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Enhances decentralized SGD by addressing data heterogeneity and network topology dependence.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-Consensus Decentralized Accelerated Gradient Descent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Haishan Ye, Luo Luo, Ziang Zhou, Tong Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops multi-consensus decentralized accelerated gradient descent for distributed optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerated Primal-Dual Mirror Dynamics for Centralized and Distributed Constrained Convex Optimization Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: You Zhao, Xiaofeng Liao, Xing He, Mingliang Zhou, Chaojie Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes accelerated primal-dual mirror dynamics for centralized and distributed convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Beyond Spectral Gap: The Role of the Topology in Decentralized Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Thijs Vogels, Hadrien Hendrikx, Martin Jaggi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Examines the role of network topology in decentralized learning optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bandits-and-online-learning"&gt;
 Bandits and Online Learning&lt;span class="heading__anchor"&gt; &lt;a href="#bandits-and-online-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing multi-armed bandits, online optimization, and regret minimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adaptation to the Range in K-Armed Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hédi Hadiji, Gilles Stoltz&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies adaptation to the range in k-armed bandit problems with regret minimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Wenhao Li, Ningyuan Chen, L. Jeff Hong&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes dimension reduction techniques for contextual online learning with nonparametric variable selection.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Non-Stationary Online Learning with Memory and Non-Stochastic Control&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Peng Zhao, Yu-Hu Yan, Yu-Xiang Wang, Zhi-Hua Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates non-stationary online learning with memory and non-stochastic control strategies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Online Non-Stochastic Control with Partial Feedback&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yu-Hu Yan, Peng Zhao, Zhi-Hua Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops online non-stochastic control methods with partial feedback for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yasin Abbasi-Yadkori, András György, Nevena Lazić&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes dynamic regret in non-stationary stochastic bandit problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A PDE Approach for Regret Bounds under Partial Monitoring&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Erhan Bayraktar, Ibrahim Ekren, Xin Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Uses a PDE-based approach to derive regret bounds for partial monitoring in online learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Continuous-in-Time Limit for Bayesian Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuhua Zhu, Zachary Izzo, Lexing Ying&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores the continuous-time limit for Bayesian bandit algorithms with theoretical guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bandit Problems with Fidelity Rewards&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Gábor Lugosi, Ciara Pike-Burke, Pierre-André Savalle&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies bandit problems with fidelity rewards, focusing on regret minimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Linear Partial Monitoring for Sequential Decision Making: Algorithms, Regret Bounds and Applications&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Johannes Kirschner, Tor Lattimore, Andreas Krause&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops algorithms and regret bounds for linear partial monitoring in sequential decision-making.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="optimization-in-reinforcement-learning"&gt;
 Optimization in Reinforcement Learning&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-in-reinforcement-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on optimization techniques for reinforcement learning, including actor-critic methods and constrained RL.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reinforcement Learning for Joint Optimization of Multiple Rewards&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mridul Agarwal, Vaneet Aggarwal&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Focuses on reinforcement learning for optimizing multiple rewards simultaneously.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Qinbo Bai, Vaneet Aggarwal, Ather Gattami&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a sample-efficient model-free algorithm for MDPs with peak constraints.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Off-Policy Actor-Critic with Emphatic Weightings&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops off-policy actor-critic methods with emphatic weightings for RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yanwei Jia, Xun Yu Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes q-learning convergence and near-optimality for MDPs with general state spaces.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kaiqing Zhang, Sham M. Kakade, Tamer Basar, Lin F. Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;F2A2: Flexible Fully-Decentralized Approximate Actor-Critic for Cooperative Multi-Agent Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Wenhao Li, Bo Jin, Xiangfeng Wang, Junchi Yan, Hongyuan Zha&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a flexible fully-decentralized approximate actor-critic method for cooperative multi-agent RL.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adaptation Augmented Model-Based Policy Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jian Shen, Hang Lai, Minghuan Liu, Han Zhao, Yong Yu, Weinan Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces adaptation-augmented model-based policy optimization for RL.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Single Timescale Actor-Critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mo Zhou, Jianfeng Lu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a single timescale actor-critic method for linear quadratic regulators with convergence guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convex Reinforcement Learning in Finite Trials&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates convex reinforcement learning with finite trials, focusing on optimization techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zihao Li, Boyi Liu, Zhuoran Yang, Zhaoran Wang, Mengdi Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a variational primal-dual policy optimization method for constrained RL.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Instance-Dependent Confidence and Early Stopping for Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Eric Xia, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops instance-dependent confidence bounds and early stopping strategies for RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="other-optimization-topics"&gt;
 Other Optimization Topics&lt;span class="heading__anchor"&gt; &lt;a href="#other-optimization-topics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers covering miscellaneous optimization topics, including Riemannian optimization, matrix completion, and optimal transport.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Radu I. Bot, Michael Sedlmayer, Phan Tu Vuong&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a relaxed inertial forward-backward-forward algorithm for monotone inclusions with applications to GANs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Discrete Variational Calculus for Accelerated Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Cédric M. Campos, Alejandro Mahillo, David Martín de Diego&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces discrete variational calculus for accelerating optimization processes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Online Optimization over Riemannian Manifolds&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xi Wang, Zhipeng Tu, Yiguang Hong, Yingyi Wu, Guodong Shi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops online optimization algorithms over Riemannian manifolds.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fast Objective &amp;amp; Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhishuai Guo, Yan Yan, Zhuoning Yuan, Tianbao Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes fast convergence for non-convex strongly-concave min-max problems under the Polyak-Łojasiewicz condition.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hamid Reza Feyzmahdavian, Mikael Johansson&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides new sequence results and sharper guarantees for asynchronous optimization iterations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Proximal ID Algorithm&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ilya Shpitser, Zach Wood-Doughty, Eric J. Tchetgen Tchetgen&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a proximal algorithm for identification problems in optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Inexact Augmented Lagrangian Algorithm for Training Leaky ReLU Neural Network with Group Sparsity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Wei Liu, Xin Liu, Xiaojun Chen&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops an inexact augmented Lagrangian algorithm for training leaky ReLU networks with group sparsity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Optimality of Nuclear-Norm-Based Matrix Completion for Problems with Smooth Non-Linear Structure&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yunhua Xiang, Tianyu Zhang, Xu Wang, Ali Shojaie, Noah Simon&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies nuclear-norm-based matrix completion for problems with smooth nonlinear structures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Importance Sparsification for Sinkhorn Algorithm&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mengyu Li, Jun Yu, Tao Li, Cheng Meng&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes importance sparsification techniques for the Sinkhorn algorithm in optimal transport.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Near-Optimal Weighted Matrix Completion&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Oscar López&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates near-optimal weighted matrix completion using optimization techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Implicit Regularization and Entrywise Convergence of Riemannian Optimization for Low Tucker-Rank Tensor Completion&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Haifeng Wang, Jinchi Chen, Ke Wei&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes implicit regularization and entrywise convergence in Riemannian optimization for low Tucker-rank tensor completion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Unbalanced Optimal Transport: Gradient Methods, Sparsity and Approximation Error&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Quang Minh Nguyen, Hoang H. Nguyen, Yi Zhou, Lam M. Nguyen&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies gradient methods for unbalanced optimal transport, focusing on sparsity and approximation error.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>Optimization Research Papers in JMLR Volume 23</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v23/</link><pubDate>Thu, 29 Sep 2022 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v23/</guid><description>&lt;h1 class="heading" id="optimization-research-papers-in-jmlr-volume-23-2022"&gt;
 Optimization Research Papers in JMLR Volume 23 (2022)&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-research-papers-in-jmlr-volume-23-2022"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;&lt;p&gt;This document lists papers from JMLR Volume 23 (2022) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.&lt;/p&gt;
&lt;h2 class="heading" id="convex-optimization"&gt;
 Convex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#convex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing convex optimization problems, including sparse PCA, L1-regularized SVMs, and metric-constrained problems.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dimitris Bertsimas, Ryan Cory-Wright, Jean Pauphilet&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops convex optimization techniques for large-scale sparse principal component analysis with certifiable near-optimal solutions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Novel Min-Max Reformulations of Linear Inverse Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mohammed Rayyan Sheriff, Debasish Chatterjee&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes min-max reformulations for linear inverse problems using convex optimization frameworks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;New Insights for the Multivariate Square-Root Lasso&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Aaron J. Molstad&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the square-root Lasso in multivariate settings, focusing on its convex optimization properties.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Towards An Efficient Approach for the Nonconvex lp Ball Projection: Algorithm and Analysis&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xiangyu Yang, Jiashan Wang, Hao Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops efficient algorithms for lp ball projection, addressing both convex and nonconvex aspects.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Solving L1-Regularized SVMs and Related Linear Programs: Revisiting the Effectiveness of Column and Constraint Generation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Antoine Dedieu, Rahul Mazumder, Haoyue Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates L1-regularized SVMs using convex optimization with column and constraint generation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Extensions to the Proximal Distance Method of Constrained Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alfonso Landeros, Oscar Hernan Madrid Padilla, Hua Zhou, Kenneth Lange&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Extends the proximal distance method for constrained convex optimization problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Subgradient for Composite Convex Optimization with Functional Constraints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ion Necoara, Nitesh Kumar Singh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes stochastic subgradient methods for composite convex optimization with functional constraints.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Regularized Square-Root Regression Problems: Distributionally Robust Interpretation and Fast Computations&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hong T.M. Chu, Kim-Chuan Toh, Yangjing Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies regularized square-root regression with a distributionally robust perspective and efficient computational methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Project and Forget: Solving Large-Scale Metric Constrained Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rishi Sonthalia, Anna C. Gilbert&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a convex optimization approach for large-scale metric-constrained problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Faster Randomized Interior Point Methods for Tall/Wide Linear Programs&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Agniva Chowdhury, Gregory Dexter, Palma London, Haim Avron, Petros Drineas&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops randomized interior point methods for efficient optimization of tall/wide linear programs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="nonconvex-optimization"&gt;
 Nonconvex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#nonconvex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers tackling nonconvex optimization, focusing on optimality, stability, and convergence in nonsmooth and game settings.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimality and Stability in Non-Convex Smooth Games&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Guojun Zhang, Pascal Poupart, Yaoliang Yu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes optimality and stability in nonconvex smooth games with convergence guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhize Li, Jian Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes simple and optimal stochastic gradient methods for nonsmooth, nonconvex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Oracle Complexity in Nonsmooth Nonconvex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Guy Kornowski, Ohad Shamir&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies the oracle complexity of nonsmooth nonconvex optimization problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Brian Swenson, Ryan Murray, H. Vincent Poor, Soummya Kar&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates distributed SGD for nonconvex, nonsmooth optimization with convergence to local minima.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="stochastic-optimization"&gt;
 Stochastic Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on stochastic optimization methods, including bundle methods, zeroth-order algorithms, and adaptive techniques.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Stochastic Bundle Method for Interpolation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, M. Pawan Kumar&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a stochastic bundle method for efficient interpolation in optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Biased Stochastic Gradient Estimation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Derek Driggs, Jingwei Liang, Carola-Bibiane Schönlieb&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes biases in stochastic gradient estimation and their impact on optimization performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes accelerated zeroth-order and first-order momentum methods for a range of optimization problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Zeroth-Order Optimization under Nonstationarity and Nonconvexity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies zeroth-order optimization in nonstationary and nonconvex settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerating Adaptive Cubic Regularization of Newton’s Method via Random Sampling&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xi Chen, Bo Jiang, Tianyi Lin, Shuzhong Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Enhances Newton’s method with adaptive cubic regularization using random sampling.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Momentumized, Adaptive, Dual Averaged Gradient Method&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Aaron Defazio, Samy Jelassi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a momentum-based adaptive gradient method for stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic DCA with Variance Reduction and Applications in Machine Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hoai An Le Thi, Hoang Phuc Hau Luu, Hoai Minh Le, Tao Pham Dinh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a stochastic difference-of-convex-functions algorithm with variance reduction for machine learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alireza Fallah, Mert Gürbüzbalaban, Asuman Ozdaglar, Umut Şimşekli, Lingjiong Zhu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes robust stochastic gradient methods for distributed optimization in multi-agent networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Acceleration for Convex Composite Minimization with Noise-Corrupted Gradients and Approximate Proximal Mapping&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Qiang Zhou, Sinno Jialin Pan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Addresses acceleration in convex composite minimization with noisy gradients.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Asymptotic Study of Stochastic Adaptive Algorithms in Non-Convex Landscape&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Sébastien Gadat, Ioana Gavra&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the asymptotic behavior of stochastic adaptive algorithms in nonconvex settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Congliang Chen, Li Shen, Fangyu Zou, Wei Liu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies the Adam optimizer, focusing on nonconvexity, convergence, and mini-batch acceleration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Efficient Sampling Algorithm for Non-Smooth Composite Potentials&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, Peter L. Bartlett&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops an efficient sampling algorithm for nonsmooth composite potentials in stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SGD with Coordinate Sampling: Theory and Practice&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rémi Leluc, François Portier&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores coordinate sampling in stochastic gradient descent with theoretical and practical insights.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="distributeddecentralized-optimization"&gt;
 Distributed/Decentralized Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#distributeddecentralized-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing distributed or decentralized optimization algorithms, focusing on communication efficiency and convergence.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Asymptotic Network Independence and Step-Size for a Distributed Subgradient Method&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alex Olshevsky&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes step-size and convergence for a distributed subgradient optimization method.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Projection-Free Distributed Online Learning with Sublinear Communication Complexity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuanyu Wan, Guanghui Wang, Wei-Wei Tu, Lijun Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops projection-free algorithms for distributed online learning with reduced communication complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Huan Li, Zhouchen Lin, Yongchun Fang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes variance-reduced methods for decentralized optimization with optimal acceleration.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="submodular-optimization"&gt;
 Submodular Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#submodular-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on submodular optimization, particularly in model selection.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Joint Continuous and Discrete Model Selection via Submodularity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jonathan Bunton, Paulo Tabuada&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Uses submodularity for joint continuous and discrete model selection in optimization.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bandits-and-online-learning"&gt;
 Bandits and Online Learning&lt;span class="heading__anchor"&gt; &lt;a href="#bandits-and-online-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing multi-armed bandits, online optimization, and regret minimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-Agent Online Optimization with Delays: Asynchronicity, Adaptivity, and Optimism&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies multi-agent online optimization with delays, focusing on asynchronicity and optimism.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Online Mirror Descent and Dual Averaging: Keeping Pace in the Dynamic Case&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Huang Fang, Nicholas J. A. Harvey, Victor S. Portella, Michael P. Friedlander&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes online mirror descent and dual averaging for dynamic online optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No Weighted-Regret Learning in Adversarial Bandits with Delays&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates regret minimization in adversarial bandits with delays.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Aurélien Garivier, Hédi Hadiji, Pierre Ménard, Gilles Stoltz&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides optimal regret bounds for stochastic bandits using KL-UCB-Switch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-Agent Multi-Armed Bandits with Limited Communication&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mridul Agarwal, Vaneet Aggarwal, Kamyar Azizzadenesheli&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores multi-agent bandits with limited communication, focusing on regret minimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Nonstochastic Bandits with Composite Anonymous Feedback&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Claudio Gentile, Yishay Mansour&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies nonstochastic bandits with composite feedback, analyzing regret and optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Expected Regret and Pseudo-Regret are Equivalent When the Optimal Arm is Unique&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Daron Anderson, Douglas J. Leith&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proves equivalence of expected regret and pseudo-regret in specific bandit settings.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bayesian-and-hyperparameter-optimization"&gt;
 Bayesian and Hyperparameter Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#bayesian-and-hyperparameter-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing Bayesian optimization and hyperparameter tuning for efficient optimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, René Sass, Frank Hutter&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Presents SMAC3, a versatile Bayesian optimization package for hyperparameter tuning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Uses implicit differentiation for efficient hyperparameter selection in nonsmooth convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Auto-Sklearn 2.0: Hands-Free AutoML via Meta-Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, Frank Hutter&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces Auto-Sklearn 2.0, leveraging meta-learning for automated hyperparameter optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="optimization-in-reinforcement-learning"&gt;
 Optimization in Reinforcement Learning&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-in-reinforcement-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on optimization techniques for reinforcement learning, including policy gradient and value estimation.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Generalized Projected Bellman Error for Off-Policy Value Estimation in Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Andrew Patterson, Adam White, Martha White&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops optimization methods for off-policy value estimation using a generalized projected Bellman error.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, Martha White&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates greedification operators for policy optimization, focusing on KL divergences.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yanwei Jia, Xun Yu Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes policy gradient and actor-critic methods for continuous-time RL optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Convergence Rates of Policy Gradient Methods&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Lin Xiao&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies convergence rates of policy gradient methods in reinforcement learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor-Critic under State Distribution Mismatch&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Shangtong Zhang, Remi Tachet des Combes, Romain Laroche&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Examines global optimality in softmax off-policy actor-critic methods under distribution mismatch.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="other-optimization-topics"&gt;
 Other Optimization Topics&lt;span class="heading__anchor"&gt; &lt;a href="#other-optimization-topics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers covering miscellaneous optimization topics, including proximal algorithms, tensor completion, and learning-to-optimize frameworks.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;TFPnP: Tuning-Free Plug-and-Play Proximal Algorithms with Applications to Inverse Imaging Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Hua Huang, Carola-Bibiane Schönlieb&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces tuning-free proximal algorithms for inverse imaging problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Complexity of Approximating Multimarginal Optimal Transport&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tianyi Lin, Nhat Ho, Marco Cuturi, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the complexity of approximating multimarginal optimal transport problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Riemannian Stochastic Proximal Gradient Methods for Nonsmooth Optimization over the Stiefel Manifold&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Bokun Wang, Shiqian Ma, Lingzhou Xue&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes stochastic proximal gradient methods for nonsmooth optimization on the Stiefel manifold.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provable Tensor-Train Format Tensor Completion by Riemannian Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jian-Feng Cai, Jingyang Li, Dong Xia&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops Riemannian optimization for tensor-train format tensor completion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Let’s Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Julie Nutini, Issam Laradji, Mark Schmidt&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Enhances block coordinate descent with faster convergence techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Efficiency of Entropic Regularized Algorithms for Optimal Transport&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tianyi Lin, Nhat Ho, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies entropic regularization for efficient optimal transport algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dachao Lin, Haishan Ye, Zhihua Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides explicit convergence rates for greedy and random quasi-Newton methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tian Tong, Cong Ma, Ashley Prater-Bennette, Erin Tripp, Yuejie Chi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Addresses nonconvex low-rank tensor estimation with provable guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learning to Optimize: A Primer and A Benchmark&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, Wotao Yin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a primer and benchmark for learning-to-optimize techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Clustering with Semidefinite Programming and Fixed Point Iteration&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Pedro Felzenszwalb, Caroline Klivans, Alice Paul&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Uses semidefinite programming and fixed-point iteration for clustering optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Bregman Learning Framework for Sparse Neural Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Leon Bungert, Tim Roith, Daniel Tenbrinck, Martin Burger&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a Bregman learning framework for optimizing sparse neural networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;When is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yoav Freund, Yi-An Ma, Tong Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes dimension-independent convergence of Langevin algorithms from a composite optimization perspective.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse Continuous Distributions and Fenchel-Young Losses&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: André F. T. Martins, Marcos Treviso, António Farinhas, Pedro M. Q. Aguiar, Mário A. T. Figueiredo, Mathieu Blondel, Vlad Niculae&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores sparse continuous distributions using Fenchel-Young losses for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handling Hard Affine SDP Shape Constraints in RKHSs&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Pierre-Cyril Aubin-Frankowski, Zoltan Szabo&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Addresses affine SDP constraints in reproducing kernel Hilbert spaces for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;OMLT: Optimization &amp;amp; Machine Learning Toolkit&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Francesco Ceccon, Jordan Jalving, Joshua Haddad, Alexander Thebelt, Calvin Tsay, Carl D Laird, Ruth Misener&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Presents OMLT, a toolkit integrating optimization and machine learning techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>Optimization Research Papers in JMLR Volume 22</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v22/</link><pubDate>Wed, 29 Sep 2021 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v22/</guid><description>&lt;h1 class="heading" id="optimization-research-papers-in-jmlr-volume-22-2021"&gt;
 Optimization Research Papers in JMLR Volume 22 (2021)&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-research-papers-in-jmlr-volume-22-2021"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;&lt;p&gt;This document lists papers from JMLR Volume 22 (2021) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.&lt;/p&gt;
&lt;h2 class="heading" id="convex-optimization"&gt;
 Convex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#convex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing convex optimization problems, including clustering, Wasserstein barycenters, sparse optimization, and bandits.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convex Clustering: Model, Theoretical Guarantee and Efficient Algorithm&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Defeng Sun, Kim-Chuan Toh, Yancheng Yuan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a convex clustering model with theoretical guarantees and an efficient algorithm.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Lei Yang, Jia Li, Defeng Sun, Kim-Chuan Toh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a fast, globally linearly convergent algorithm for computing Wasserstein barycenters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Wasserstein Barycenters Can Be Computed in Polynomial Time in Fixed Dimension&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jason M. Altschuler, Enric Boix-Adsera&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Demonstrates that Wasserstein barycenters can be computed in polynomial time for fixed dimensions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;From Low Probability to High Confidence in Stochastic Convex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, Junyu Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes methods to achieve high-confidence solutions in stochastic convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse and Smooth Signal Estimation: Convexification of L0-Formulations&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alper Atamturk, Andres Gomez, Shaoning Han&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes convexification techniques for L0-formulations in sparse and smooth signal estimation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Proximal AUC Maximization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yunwen Lei, Yiming Ying&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops stochastic proximal methods for maximizing the area under the ROC curve (AUC) in convex settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sparse Convex Optimization via Adaptively Regularized Hard Thresholding&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kyriakos Axiotis, Maxim Sviridenko&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces adaptively regularized hard thresholding for sparse convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Antoine Dedieu, Hussein Hazimeh, Rahul Mazumder&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores continuous and mixed-integer optimization approaches for learning sparse classifiers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;First-Order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides first-order convergence theory for weakly convex-weakly concave min-max problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convex Geometry and Duality of Over-parameterized Neural Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tolga Ergen, Mert Pilanci&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convex geometry and duality in over-parameterized neural networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Linear Bandits on Uniformly Convex Sets&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Thomas Kerdreux, Christophe Roux, Alexandre d&amp;rsquo;Aspremont, Sebastian Pokutta&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies linear bandits on uniformly convex sets, focusing on convex optimization techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="nonconvex-optimization"&gt;
 Nonconvex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#nonconvex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers tackling nonconvex optimization, including stochastic gradient descent, neural network training, and stability properties.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes online stochastic gradient descent for nonconvex losses in high-dimensional inference.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Non-attracting Regions of Local Minima in Deep and Wide Neural Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Henning Petzka, Cristian Sminchisescu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates non-attracting regions of local minima in deep and wide neural networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;When Does Gradient Descent with Logistic Loss Find Interpolating Two-Layer Networks?&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Examines conditions under which gradient descent with logistic loss finds interpolating two-layer networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Replica Exchange for Non-Convex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jing Dong, Xin T. Tong&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes replica exchange methods for nonconvex optimization problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Failures of Model-Dependent Generalization Bounds for Least-Norm Interpolation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Peter L. Bartlett, Philip M. Long&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes limitations of model-dependent generalization bounds in least-norm interpolation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Stability Properties and the Optimization Landscape of Training Problems with Squared Loss for Neural Networks and General Nonlinear Conic Approximation Schemes&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Constantin Christof&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies stability and optimization landscapes for neural network training with squared loss.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="stochastic-optimization"&gt;
 Stochastic Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on stochastic optimization methods, including momentum, Langevin dynamics, and communication-efficient algorithms.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Continuous Time Analysis of Momentum Methods&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nikola B. Kovachki, Andrew M. Stuart&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a continuous-time analysis of momentum methods in stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yunwen Lei, Ting Hu, Ke Tang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes generalization performance of multi-pass stochastic gradient descent for convex losses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops an accelerated MCMC algorithm using high-order Langevin diffusion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Path Length Bounds for Gradient Descent and Flow&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Chirag Gupta, Sivaraman Balakrishnan, Aaditya Ramdas&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Establishes path length bounds for gradient descent and flow in stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Michael Muehlebach, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes momentum-based optimization from dynamical, control-theoretic, and symplectic perspectives.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;L-SVRG and L-Katyusha with Arbitrary Sampling&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xun Qian, Zheng Qu, Peter Richtárik&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces L-SVRG and L-Katyusha algorithms with arbitrary sampling for stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Lyapunov Analysis of Accelerated Methods in Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ashia C. Wilson, Ben Recht, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a Lyapunov analysis for accelerated optimization methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;NUQSGD: Provably Communication-Efficient Data-Parallel SGD via Nonuniform Quantization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes NUQSGD, a communication-efficient stochastic gradient descent method using nonuniform quantization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Inertial Newton Algorithm for Deep Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops an inertial Newton algorithm for deep learning optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tian Tong, Cong Ma, Yuejie Chi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes scaled gradient descent for accelerating ill-conditioned low-rank matrix estimation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On ADMM in Deep Learning: Convergence and Saturation-Avoidance&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jinshan Zeng, Shao-Bo Lin, Yuan Yao, Ding-Xuan Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence and saturation-avoidance properties of ADMM in deep learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Unified Convergence Analysis for Shuffling-Type Gradient Methods&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, Marten van Dijk&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a unified convergence analysis for shuffling-type gradient methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Online Optimization Using Kalman Recursion&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Joseph de Vilmarest, Olivier Wintenberger&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies Kalman recursion to stochastic online optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Expanding Boundaries of Gap Safe Screening&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Cassio F. Dantas, Emmanuel Soubies, Cédric Févotte&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Expands gap safe screening techniques for stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consensus-Based Optimization on the Sphere: Convergence to Global Minimizers and Machine Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Massimo Fornasier, Lorenzo Pareschi, Hui Huang, Philippe Sünnen&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops consensus-based optimization on the sphere with applications to machine learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte Carlo&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mert Gürbüzbalaban, Xuefeng Gao, Yuanhan Hu, Lingjiong Zhu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes decentralized stochastic gradient Langevin dynamics and Hamiltonian Monte Carlo methods.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="distributeddecentralized-optimization"&gt;
 Distributed/Decentralized Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#distributeddecentralized-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing distributed or decentralized optimization algorithms, focusing on communication efficiency and scalability.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Projection-Free Decentralized Online Learning for Submodular Maximization over Time-Varying Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Junlong Zhu, Qingtao Wu, Mingchuan Zhang, Ruijuan Zheng, Keqin Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops projection-free decentralized online learning for submodular maximization over time-varying networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zengfeng Huang, Xuemin Lin, Wenjie Zhang, Ying Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a communication-efficient distributed covariance sketch for distributed PCA.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimal Rates of Distributed Regression with Imperfect Kernels&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hongwei Sun, Qiang Wu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Establishes optimal rates for distributed regression with imperfect kernels.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;One-Shot Federated Learning: Theoretical Limits and Algorithms to Achieve Them&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Saber Salehkaleybar, Arsalan Sharifnassab, S. Jamaloddin Golestani&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes theoretical limits and algorithms for one-shot federated learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jianyu Wang, Gauri Joshi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a unified framework for designing and analyzing local-update SGD algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DeEPCA: Decentralized Exact PCA with Linear Convergence Rate&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Haishan Ye, Tong Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops DeEPCA, a decentralized exact PCA method with linear convergence.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="submodular-optimization"&gt;
 Submodular Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#submodular-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on submodular optimization, particularly in experimental design.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Batch Greedy Maximization of Non-Submodular Functions: Guarantees and Applications to Experimental Design&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jayanth Jagalur-Mohan, Youssef Marzouk&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides guarantees for batch greedy maximization of non-submodular functions with applications to experimental design.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bandits-and-online-learning"&gt;
 Bandits and Online Learning&lt;span class="heading__anchor"&gt; &lt;a href="#bandits-and-online-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing multi-armed bandits, online optimization, and regret minimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Regulating Greed Over Time in Multi-Armed Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Stefano Tracà, Cynthia Rudin, Weiyu Yan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies methods to regulate greed over time in multi-armed bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Preference-Based Online Learning with Dueling Bandits: A Survey&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, Eyke Hüllermeier&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Surveys preference-based online learning with dueling bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Multi-Armed Bandit Designs for Dose-Finding Trials&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Maryam Aziz, Emilie Kaufmann, Marie-Karelle Riviere&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores multi-armed bandit designs for dose-finding trials.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Julian Zimmert, Yevgeny Seldin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes Tsallis-INF, an optimal algorithm for stochastic and adversarial bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bandit Convex Optimization in Non-Stationary Environments&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Peng Zhao, Guanghui Wang, Lijun Zhang, Zhi-Hua Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Addresses bandit convex optimization in non-stationary environments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Contextual Bandit Bake-off&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alberto Bietti, Alekh Agarwal, John Langford&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Compares contextual bandit algorithms in a comprehensive evaluation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MetaGrad: Adaptation Using Multiple Learning Rates in Online Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Tim van Erven, Wouter M. Koolen, Dirk van der Hoeven&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces MetaGrad, an adaptive online learning algorithm with multiple learning rates.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Achieving Fairness in the Stochastic Multi-Armed Bandit Problem&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Vishakha Patil, Ganesh Ghalme, Vineet Nair, Y. Narahari&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops methods for achieving fairness in stochastic multi-armed bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Refined Approachability Algorithms and Application to Regret Minimization with Global Costs&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Joon Kwon&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes refined approachability algorithms for regret minimization with global costs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bandit Learning in Decentralized Matching Markets&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Lydia T. Liu, Feng Ruan, Horia Mania, Michael I. Jordan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies bandit learning to decentralized matching markets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Thompson Sampling Algorithms for Cascading Bandits&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zixin Zhong, Wang Chi Chueng, Vincent Y. F. Tan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops Thompson sampling algorithms for cascading bandits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fast Learning for Renewal Optimization in Online Task Scheduling&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Michael J. Neely&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes fast learning methods for renewal optimization in online task scheduling.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bayesian-and-hyperparameter-optimization"&gt;
 Bayesian and Hyperparameter Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#bayesian-and-hyperparameter-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing Bayesian optimization and hyperparameter tuning for scalable and robust optimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Empirical Study of Bayesian Optimization: Acquisition Versus Partition&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Erich Merrill, Alan Fern, Xiaoli Fern, Nima Dolatnia&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Conducts an empirical study comparing acquisition and partition strategies in Bayesian optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hyperparameter Optimization via Sequential Uniform Designs&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zebin Yang, Aijun Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes sequential uniform designs for hyperparameter optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Are We Forgetting about Compositional Optimisers in Bayesian Optimisation?&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Antoine Grosnit, Alexander I. Cowen-Rivers, Rasul Tutunov, Ryan-Rhys Griffiths, Jun Wang, Haitham Bou-Ammar&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores the role of compositional optimizers in Bayesian optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;GIBBON: General-Purpose Information-Based Bayesian Optimisation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Henry B. Moss, David S. Leslie, Javier Gonzalez, Paul Rayson&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces GIBBON, a general-purpose information-based Bayesian optimization framework.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On lp-Hyperparameter Learning via Bilevel Nonsmooth Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Takayuki Okuno, Akiko Takeda, Akihiro Kawana, Motokazu Watanabe&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies lp-hyperparameter learning using bilevel nonsmooth optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="optimization-in-reinforcement-learning"&gt;
 Optimization in Reinforcement Learning&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-in-reinforcement-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on optimization techniques for reinforcement learning, including policy iteration and Q-learning.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alberto Maria Metelli, Matteo Pirotta, Daniele Calandriello, Marcello Restelli&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a safe policy iteration method with monotonic improvement for reinforcement learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the optimality, approximation, and distribution shift in policy gradient methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Vikram Krishnamurthy, George Yin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies Langevin dynamics to adaptive inverse reinforcement learning for stochastic gradient algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jeongho Kim, Jaeuk Shin, Insoon Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops Hamilton-Jacobi deep Q-learning for deterministic continuous-time systems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partial Policy Iteration for L1-Robust Markov Decision Processes&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Chin Pang Ho, Marek Petrik, Wolfram Wiesemann&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces partial policy iteration for L1-robust Markov decision processes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Gaussian Approximation for Bias Reduction in Q-Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Carlo D&amp;rsquo;Eramo, Andrea Cini, Alessandro Nuara, Matteo Pirotta, Cesare Alippi, Jan Peters, Marcello Restelli&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes Gaussian approximation techniques for bias reduction in Q-learning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="other-optimization-topics"&gt;
 Other Optimization Topics&lt;span class="heading__anchor"&gt; &lt;a href="#other-optimization-topics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers covering miscellaneous optimization topics, including Newton methods, SVM training, and eigenvector computation.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Global and Quadratic Convergence of Newton Hard-Thresholding Pursuit&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Shenglong Zhou, Naihua Xiu, Hou-Duo Qi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes global and quadratic convergence of Newton hard-thresholding pursuit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Two-Level Decomposition Framework Exploiting First and Second Order Information for SVM Training Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Giulio Galvan, Matteo Lapucci, Chih-Jen Lin, Marco Sciandrone&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a two-level decomposition framework for SVM training using first and second-order information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Approximate Newton Methods&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Haishan Ye, Luo Luo, Zhihua Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops approximate Newton methods for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Guodong Zhang, Xuchan Bao, Laurent Lessard, Roger Grosse&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides a unified analysis of first-order methods for smooth games using integral quadratic constraints.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;LassoNet: A Neural Network with Feature Sparsity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ismael Lemhadri, Feng Ruan, Louis Abraham, Robert Tibshirani&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces LassoNet, a neural network architecture promoting feature sparsity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An Algorithmic View of L2 Regularization and Some Path-Following Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yunzhang Zhu, Renxiong Liu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Explores L2 regularization from an algorithmic perspective with path-following algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Ensmallen Library for Flexible Numerical Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Ryan R. Curtin, Marcus Edel, Rahul Ganesh Prabhu, Suryoday Basak, Zhihao Lou, Conrad Sanderson&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces the ensmallen library for flexible numerical optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Black-Box Reductions for Zeroth-Order Gradient Algorithms to Achieve Lower Query Complexity&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Bin Gu, Xiyuan Wei, Shangqian Gao, Ziran Xiong, Cheng Deng, Heng Huang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes black-box reductions for zeroth-order gradient algorithms to reduce query complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Riemannian Search for Eigenvector Computation&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhiqiang Xu, Ping Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops Riemannian search methods for eigenvector computation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>Optimization Research Papers in JMLR Volume 21</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v21/</link><pubDate>Tue, 29 Sep 2020 00:00:00 +0000</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/jmlr-v21/</guid><description>&lt;h1 class="heading" id="optimization-research-papers-in-jmlr-volume-21-2020"&gt;
 Optimization Research Papers in JMLR Volume 21 (2020)&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-research-papers-in-jmlr-volume-21-2020"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;&lt;p&gt;This document lists papers from JMLR Volume 21 (2020) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.&lt;/p&gt;
&lt;h2 class="heading" id="convex-optimization"&gt;
 Convex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#convex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing convex optimization problems, including complexity bounds, convergence analysis, and applications in regression and assortment optimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Low Complexity Algorithm with O(√T) Regret and O(1) Constraint Violations for Online Convex Optimization with Long Term Constraints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Hao Yu, Michael J. Neely&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a low-complexity algorithm for online convex optimization with long-term constraints, achieving O(√T) regret and O(1) constraint violations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lower Bounds for Parallel and Randomized Convex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Jelena Diakonikolas, Cristóbal Guzmán&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Establishes lower complexity bounds for parallel and randomized algorithms in convex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Discerning the Linear Convergence of ADMM for Structured Convex Optimization through the Lens of Variational Analysis&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xiaoming Yuan, Shangzhi Zeng, Jin Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the linear convergence of ADMM for structured convex optimization using variational analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Qihang Lin, Selvaprabu Nadarajah, Negar Soheili, Tianbao Yang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a data-efficient level set method for stochastic convex optimization with expectation constraints.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Conic Optimization for Quadratic Regression Under Sparse Noise&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Igor Molybog, Ramtin Madani, Javad Lavaei&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies conic optimization to quadratic regression under sparse noise conditions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dynamic Assortment Optimization with Changing Contextual Information&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xi Chen, Yining Wang, Yuan Zhou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Addresses dynamic assortment optimization with changing contextual information using convex optimization techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convex Programming for Estimation in Nonlinear Recurrent Models&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Sohail Bahmani, Justin Romberg&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Uses convex programming for parameter estimation in nonlinear recurrent models.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="nonconvex-optimization"&gt;
 Nonconvex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#nonconvex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers tackling nonconvex optimization, focusing on guarantees for local minima, variance reduction, and algorithmic advancements.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Exact Guarantees on the Absence of Spurious Local Minima for Non-negative Rank-1 Robust Principal Component Analysis&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Salar Fattahi, Somayeh Sojoudi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides exact guarantees for the absence of spurious local minima in non-negative rank-1 robust PCA.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Nested Variance Reduction for Nonconvex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dongruo Zhou, Pan Xu, Quanquan Gu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces a stochastic nested variance reduction method for nonconvex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, Quoc Tran-Dinh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes ProxSARAH, an efficient framework for stochastic composite nonconvex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convergence Rates for the Stochastic Gradient Descent Method for Non-Convex Objective Functions&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Benjamin Fehrman, Benjamin Gess, Arnulf Jentzen&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence rates of stochastic gradient descent for nonconvex objective functions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rachel Ward, Xiaoxia Wu, Leon Bottou&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies sharp convergence of AdaGrad stepsize schedules in nonconvex optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Sparse Semismooth Newton Based Proximal Majorization-Minimization Algorithm for Nonconvex Square-Root-Loss Regression Problems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Peipei Tang, Chengjing Wang, Defeng Sun, Kim-Chuan Toh&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops a sparse semismooth Newton-based proximal majorization-minimization algorithm for nonconvex square-root-loss regression.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="stochastic-optimization"&gt;
 Stochastic Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on stochastic optimization methods, including gradient descent, variance reduction, and robustness to noise.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Convergences of Regularized Algorithms and Stochastic Gradient Methods with Random Projections&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Junhong Lin, Volkan Cevher&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence of regularized algorithms and stochastic gradient methods with random projections.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dominic Richards, Patrick Rebeschini&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies graph-dependent implicit regularization in distributed stochastic subgradient descent.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a robust asynchronous stochastic gradient-push method with asymptotically optimal performance for strongly convex functions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xi Chen, Simon S. Du, Xin T. Tong&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Investigates stationary-point hitting time and ergodicity in stochastic gradient Langevin dynamics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Aryan Mokhtari, Hamed Hassani, Amin Karbasi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Extends stochastic conditional gradient methods from convex minimization to submodular maximization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Aryan Mokhtari, Alec Koppel, Martin Takac, Alejandro Ribeiro&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces parallel doubly stochastic algorithms for large-scale learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yao Ma, Alex Olshevsky, Csaba Szepesvari, Venkatesh Saligrama&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies gradient descent to sparse rank-one matrix completion for crowd-sourced worker aggregation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Junhong Lin, Volkan Cevher&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Establishes optimal convergence rates for distributed learning using stochastic gradient methods and spectral algorithms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Andrei Kulunchakov, Julien Mairal&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops estimate sequences for stochastic composite optimization with variance reduction and noise robustness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Unified q-Memorization Framework for Asynchronous Stochastic Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Bin Gu, Wenhan Xian, Zhouyuan Huo, Cheng Deng, Heng Huang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a unified q-memorization framework for asynchronous stochastic optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yazhen Wang, Shang Wu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes gradient descent algorithms using stochastic differential equations in statistical and computational settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Error-Feedback Framework: SGD with Delayed Gradients&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Sebastian U. Stich, Sai Praneeth Karimireddy&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces an error-feedback framework for stochastic gradient descent with delayed gradients.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="distributedparallel-optimization"&gt;
 Distributed/Parallel Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#distributedparallel-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing distributed or parallel optimization algorithms, focusing on communication efficiency and scalability.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On the Complexity Analysis of the Primal Solutions for the Accelerated Randomized Dual Coordinate Ascent&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Huan Li, Zhouchen Lin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the complexity of primal solutions for accelerated randomized dual coordinate ascent in distributed settings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Edgar Dobriban, Yue Sheng&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes WONDER, a weighted one-shot distributed ridge regression method for high-dimensional data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Anis Elgabli, Jihong Park, Amrit S. Bedi, Mehdi Bennis, Vaneet Aggarwal&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces GADMM, a fast and communication-efficient framework for distributed machine learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Boyue Li, Shicong Cen, Yuxin Chen, Yuejie Chi&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops communication-efficient distributed optimization with gradient tracking and variance reduction.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On Convergence of Distributed Approximate Newton Methods: Globalization, Sharper Bounds and Beyond&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Xiao-Tong Yuan, Ping Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes convergence of distributed approximate Newton methods with sharper bounds and globalization techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="submodular-optimization"&gt;
 Submodular Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#submodular-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on submodular optimization, including minimization and maximization problems.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quadratic Decomposable Submodular Function Minimization: Theory and Practice&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Pan Li, Niao He, Olgica Milenkovic&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies quadratic decomposable submodular function minimization with theoretical and practical insights.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimal Algorithms for Continuous Non-monotone Submodular and DR-Submodular Maximization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rad Niazadeh, Tim Roughgarden, Joshua R. Wang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops optimal algorithms for continuous non-monotone submodular and DR-submodular maximization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="bayesian-and-hyperparameter-optimization"&gt;
 Bayesian and Hyperparameter Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#bayesian-and-hyperparameter-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers addressing Bayesian optimization and hyperparameter tuning for scalable and robust optimization.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R. Collins, Jeff Schneider, Barnabas Poczos, Eric P. Xing&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces Dragonfly, a scalable and robust Bayesian optimization framework for hyperparameter tuning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributionally Ambiguous Optimization for Batch Bayesian Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Nikitas Rontsis, Michael A. Osborne, Paul J. Goulart&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes distributionally ambiguous optimization for batch Bayesian optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Kalai-Smorodinsky Solution for Many-Objective Bayesian Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mickael Binois, Victor Picheny, Patrick Taillandier, Abderrahmane Habbal&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies the Kalai-Smorodinsky solution to many-objective Bayesian optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Robust Reinforcement Learning with Bayesian Optimisation and Quadrature&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Supratik Paul, Konstantinos Chatzilygeroudis, Kamil Ciosek, Jean-Baptiste Mouret, Michael A. Osborne, Shimon Whiteson&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Integrates Bayesian optimization and quadrature for robust reinforcement learning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="optimization-in-reinforcement-learning"&gt;
 Optimization in Reinforcement Learning&lt;span class="heading__anchor"&gt; &lt;a href="#optimization-in-reinforcement-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers focusing on optimization techniques for policy optimization and reinforcement learning.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, Martin J. Wainwright&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops derivative-free methods for policy optimization in linear quadratic systems with guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Expected Policy Gradients for Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Kamil Ciosek, Shimon Whiteson&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces expected policy gradients for reinforcement learning optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Importance Sampling Techniques for Policy Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Alberto Maria Metelli, Matteo Papini, Nico Montali, Marcello Restelli&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes importance sampling techniques for efficient policy optimization in reinforcement learning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 class="heading" id="other-optimization-topics"&gt;
 Other Optimization Topics&lt;span class="heading__anchor"&gt; &lt;a href="#other-optimization-topics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Papers covering miscellaneous optimization topics, including dictionary learning, neural network verification, and differential privacy.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learning with Fenchel-Young Losses&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Mathieu Blondel, André F.T. Martins, Vlad Niculae&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces optimization with Fenchel-Young losses for structured prediction.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Branch and Bound for Piecewise Linear Neural Network Verification&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Rudy Bunel, Jingyue Lu, Ilker Turkaslan, Philip H.S. Torr, Pushmeet Kohli, M. Pawan Kumar&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies branch and bound techniques for piecewise linear neural network verification.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Conjugate Gradients for Kernel Machines&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Simon Bartels, Philipp Hennig&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops conjugate gradient methods for optimization in kernel machines.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unique Sharp Local Minimum in L1-Minimization Complete Dictionary Learning&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yu Wang, Siqi Wu, Bin Yu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes unique sharp local minima in L1-minimization for complete dictionary learning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Community-Based Group Graphical Lasso&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Eugen Pircalabelu, Gerda Claeskens&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a community-based group graphical Lasso for structured optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Toby Dylan Hocking, Guillem Rigaill, Paul Fearnhead, Guillaume Bourque&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops constrained dynamic programming and supervised penalty learning for peak detection in genomic data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Loss Control with Rank-One Covariance Estimate for Short-Term Portfolio Optimization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Zhao-Rong Lai, Liming Tan, Xiaotian Wu, Liangda Fang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies rank-one covariance estimation for loss control in short-term portfolio optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Owen Marschall, Kyunghyun Cho, Cristina Savin&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a unified framework of online learning algorithms for training recurrent neural networks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Networks&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Amir R. Asadi, Emmanuel Abbe&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Introduces multilevel entropic regularization for neural network training using chaining and chain rule.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Nesterov&amp;rsquo;s Acceleration for Approximate Newton&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Haishan Ye, Luo Luo, Zhihua Zhang&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Applies Nesterov’s acceleration to approximate Newton methods for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;New Insights and Perspectives on the Natural Gradient Method&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: James Martens&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Provides new insights into the natural gradient method for optimization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Complete Dictionary Learning via L4-Norm Maximization over the Orthogonal Group&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Yuexiang Zhai, Zitong Yang, Zhenyu Liao, John Wright, Yi Ma&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops complete dictionary learning via L4-norm maximization over the orthogonal group.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Empirical Risk Minimization in the Non-Interactive Local Model of Differential Privacy&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Di Wang, Marco Gaboardi, Adam Smith, Jinhui Xu&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Studies empirical risk minimization in the non-interactive local model of differential privacy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stable Regression: On the Power of Optimization over Randomization&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dimitris Bertsimas, Ivan Paskov&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Analyzes the power of optimization over randomization in stable regression.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fast Exact Matrix Completion: A Unified Optimization Framework for Matrix Completion&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Dimitris Bertsimas, Michael Lingzhi Li&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Proposes a unified optimization framework for fast exact matrix completion.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Rank-Based Lasso - Efficient Methods for High-Dimensional Robust Model Selection&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;Authors&lt;/em&gt;: Wojciech Rejchel, Małgorzata Bogdan&lt;br&gt;
&lt;em&gt;Description&lt;/em&gt;: Develops rank-based Lasso methods for high-dimensional robust model selection.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</description></item></channel></rss>