<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Machine Learning on Nam Le</title><link>https://blog.namln.org/en/tags/machine-learning/</link><description>Recent content in Machine Learning on Nam Le</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Thu, 02 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.namln.org/en/tags/machine-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Mathematics - Optimization</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/</link><pubDate>Thu, 27 Jun 2024 23:14:15 +0800</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/</guid><description>&lt;h1 class="heading" id="branches-of-optimization-research"&gt;
 Branches of Optimization Research&lt;span class="heading__anchor"&gt; &lt;a href="#branches-of-optimization-research"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;&lt;h2 class="heading" id="convex-optimization"&gt;
 Convex Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#convex-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Convex optimization focuses on problems where the objective function and constraints are convex, ensuring a single global optimum. This field is foundational in machine learning, signal processing, and control systems due to its guaranteed convergence and efficient algorithms.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Convex Optimization&lt;/em&gt; by Boyd and Vandenberghe - &lt;a href="https://web.stanford.edu/~boyd/cvxbook/"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Convex Optimization Theory&lt;/em&gt; by Dimitri P. Bertsekas - &lt;a href="https://web.mit.edu/dimitrib/www/Convex_Theory_Entire_Book.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="discrete-combinatorial-and-integer-optimization"&gt;
 Discrete, Combinatorial, and Integer Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#discrete-combinatorial-and-integer-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;This branch deals with optimization problems involving discrete variables, such as integers or combinatorial structures, often encountered in scheduling, network design, and logistics. Bayesian optimization, a subset, is particularly useful for optimizing expensive black-box functions.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Bayesian Optimization In Action&lt;/em&gt; by Quan Nguyen - &lt;a href="https://www.amazon.com/Bayesian-Optimization-Action-Quan-Nguyen/dp/1633439070"&gt;Amazon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Experimentation for Engineers&lt;/em&gt; by David Sweet - &lt;a href="https://www.amazon.com/Tuning-Up-testing-Bayesian-optimization/dp/1617298158"&gt;Amazon&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="operations-research"&gt;
 Operations Research&lt;span class="heading__anchor"&gt; &lt;a href="#operations-research"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Operations research applies mathematical modeling and optimization to complex decision-making in logistics, supply chain, and resource allocation. It integrates techniques like linear programming, simulation, and heuristic methods to optimize real-world systems.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Operations Research An Introduction&lt;/em&gt; by Hamdy A. Taha - &lt;a href="https://www.pearson.com/en-us/subject-catalog/p/operations-research-an-introduction/P200000003221"&gt;Pearson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Introduction to Operations Research&lt;/em&gt; by Frederick Hillier and Gerald Lieberman - &lt;a href="https://www.mheducation.com/highered/product/introduction-operations-research-hillier-lieberman/M9781259872990.html"&gt;McGraw Hill&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Julia Programming for Operations Research&lt;/em&gt; by Changhyun Kwon - &lt;a href="https://juliabook.chkwon.net/book"&gt;PDF&lt;/a&gt; - &lt;a href="https://github.com/chkwon/jpor_codes"&gt;code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Mathematical Programming and Operations Research: Modeling, Algorithms, and Complexity. Examples in Python and Julia&lt;/em&gt;. Edited by Robert Hildebrand - &lt;a href="https://github.com/open-optimization/open-optimization-or-book/blob/master/MathematicalProgrammingandOperationsResearch.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;A First Course in Linear Optimization&lt;/em&gt; by Jon Lee - &lt;a href="https://www.solvermax.com/downloads/lee-linearoptimization4.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Decomposition Techniques in Mathematical Programming&lt;/em&gt; by Conejo , Castillo , Mínguez , and García-Bertrand - &lt;a href="https://link.springer.com/book/10.1007/3-540-27686-6"&gt;Springer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Algorithms for Optimization&lt;/em&gt; by Mykel J. Kochenderfer and Tim A. Wheeler - &lt;a href="https://algorithmsbook.com/optimization/files/optimization.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Model Building in Mathematical Programming&lt;/em&gt; - Introductory modeling book by H. Paul Williams - &lt;a href="https://www.wiley.com/en-ie/Model&amp;#43;Building&amp;#43;in&amp;#43;Mathematical&amp;#43;Programming,&amp;#43;5th&amp;#43;Edition-p-9781118443330"&gt;Wiley&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="meta-heuristics"&gt;
 Meta-heuristics&lt;span class="heading__anchor"&gt; &lt;a href="#meta-heuristics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Meta-heuristics are high-level strategies for solving complex optimization problems where exact methods are computationally infeasible. They include nature-inspired algorithms like genetic algorithms and simulated annealing, widely used in engineering and data science.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Metaheuristics&lt;/em&gt; by Patrick Siarry - &lt;a href="https://link.springer.com/book/10.1007/978-3-319-45403-0"&gt;Springer (open access)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Essentials of Metaheuristics&lt;/em&gt; by Sean Luke - &lt;a href="https://cs.gmu.edu/~sean/book/metaheuristics/"&gt;link&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Handbook of Metaheuristics&lt;/em&gt; by Michel Gendreau and Jean-Yves Potvin - &lt;a href="https://link.springer.com/book/10.1007/978-1-4419-1665-5"&gt;Springer (open access)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;An Introduction to Metaheuristics for Optimization&lt;/em&gt; by Bastien Chopard , Marco Tomassini - &lt;a href="https://link.springer.com/book/10.1007/978-3-319-93073-2"&gt;Springer (open access)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Metaheuristic and Evolutionary Computation: Algorithms and Applications&lt;/em&gt; by Hasmat Malik, Atif Iqbal, Puneet Joshi, Sanjay Agrawal, and Farhad Ilahi Bakhsh - &lt;a href="https://link.springer.com/book/10.1007/978-981-15-7571-6"&gt;Springer (open access)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Clever Algorithms: Nature-Inspired Programming Recipes&lt;/em&gt; by Jason Brownlee - &lt;a href="https://github.com/clever-algorithms/CleverAlgorithms"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Metaheuristics: from design to implementation&lt;/em&gt; by El-Ghazali Talbi - &lt;a href="https://www.wiley.com/en-us/Metaheuristics%3A&amp;#43;From&amp;#43;Design&amp;#43;to&amp;#43;Implementation&amp;#43;-p-9780470278581#:~:text=Description,-A%20unified%20view&amp;amp;text=This%20book%20provides%20a%20complete,design%2C%20routing%2C%20and%20scheduling."&gt;Wiley&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="dynamic-programming-and-reinforcement-learning"&gt;
 Dynamic Programming and Reinforcement Learning&lt;span class="heading__anchor"&gt; &lt;a href="#dynamic-programming-and-reinforcement-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Dynamic programming and reinforcement learning address sequential decision-making problems, breaking them into subproblems or learning optimal policies through interaction with environments. These methods are critical in robotics, finance, and AI.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Various tiltes on &lt;em&gt;Dynamic Programming, Optimal Control and Reinforcement Learning&lt;/em&gt; by Dimitri Bertsekas. - &lt;a href="http://www.athenasc.com/index.html"&gt;List&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Reinforcement Learning: An Introduction (2nd Edition)&lt;/em&gt; by Richard Sutton and Andrew Barto - &lt;a href="http://incompleteideas.net/book/RLbook2020.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Decision Making Under Uncertainty: Theory and Application&lt;/em&gt; by Mykel J. Kochenderfer - &lt;a href="https://web.stanford.edu/group/sisl/public/dmu.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Algorithms for Decision Making&lt;/em&gt; by Mykel J. Kochenderfer, Tim A. Wheeler, and Kyle H. Wray - &lt;a href="https://algorithmsbook.com/files/dm.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="constraint-programming"&gt;
 Constraint Programming&lt;span class="heading__anchor"&gt; &lt;a href="#constraint-programming"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Constraint programming solves problems by defining constraints that must be satisfied, often used in scheduling, planning, and configuration tasks. It excels in problems with complex logical constraints and discrete variables.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Handbook of Constraint Programming&lt;/em&gt; by Francesca Rossi, Peter van Beek and Toby Walsh - &lt;a href="https://www.amazon.com/dp/0444527265"&gt;Amazon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;A Tutorial on Constraint Programming&lt;/em&gt; by Barbara M. Smith (University of Leeds) - &lt;a href="https://www.dcs.gla.ac.uk/~pat/cpM/papers/95_14.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="combinatorial-optimization"&gt;
 Combinatorial Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#combinatorial-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Combinatorial optimization focuses on finding optimal solutions in discrete structures, such as graphs or sets, often using algorithms for problems like the traveling salesman or graph coloring, with applications in logistics and network design.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Combinatorial Optimization: Algorithms and Complexity&lt;/em&gt; by by Christos H. Papadimitriou and Kenneth Steiglitz - &lt;a href="https://www.amazon.com/Combinatorial-Optimization-Algorithms-Complexity-Computer-ebook/dp/B00C8UQZAO"&gt;Amazon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Combinatorial Optimization: Theory and Algorithms&lt;/em&gt; by Bernhard Korte and Jens Vygen - &lt;a href="https://link.springer.com/book/10.1007/978-3-662-56039-6"&gt;Springer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;A First Course in Combinatorial Optimization&lt;/em&gt; by Jon Lee - &lt;a href="https://www.amazon.com/Combinatorial-Optimization-Cambridge-Applied-Mathematics/dp/0521010128"&gt;Amazon&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="stochastic-optimization-and-control"&gt;
 Stochastic Optimization and Control&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-optimization-and-control"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Stochastic optimization handles problems with uncertainty or randomness, using probabilistic models to optimize objectives. It is widely applied in machine learning, finance, and operations research for robust decision-making.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Lectures on Stochastic Programming Modeling and Theory&lt;/em&gt; (SIAM) - by Shapiro, Dentcheva, and Ruszczynski - &lt;a href="https://bpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/4/1470/files/2021/03/SPbook.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Introductory Lectures on Stochastic Optimization&lt;/em&gt; by John C. Duchi - &lt;a href="https://web.stanford.edu/~jduchi/PCMIConvex/Duchi16.pdf"&gt;PDF&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 class="heading" id="useful-resources"&gt;
 Useful Resources&lt;span class="heading__anchor"&gt; &lt;a href="#useful-resources"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Prof. Nguyen Mau Nam, &lt;a href="https://maunamn.wordpress.com/"&gt;Convex Analysis - An introduction to convexity and nonsmooth analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ben Recht, &lt;a href="https://www.argmin.net/"&gt;arg min&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Prof. Dimitri P. Bertsekas, &lt;a href="http://www.athenasc.com/convexity.html"&gt;Convex Analysis and Optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Prof. Dimitri P. Bertsekas, &lt;a href="http://www.athenasc.com/nonlinbook.html"&gt;Nonlinear Programming: 3rd Edition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.offconvex.org/"&gt;Off the convex path&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h1 class="heading" id="post-on-optimization"&gt;
 Post on Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#post-on-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h1&gt;</description></item><item><title>Second-order Stochastic Optimization methods for Machine Learning</title><link>https://blog.namln.org/en/mathematics/analysis/optimization/soms/</link><pubDate>Thu, 27 Jun 2024 23:14:15 +0800</pubDate><guid>https://blog.namln.org/en/mathematics/analysis/optimization/soms/</guid><description>&lt;h2 class="heading" id="analysis-of-the-hessian"&gt;
 Analysis of the Hessian&lt;span class="heading__anchor"&gt; &lt;a href="#analysis-of-the-hessian"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-empirical-analysis-of-the-hessian-of-over-parametrized-neural-networks"&gt;
 1. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks&lt;span class="heading__anchor"&gt; &lt;a href="#1-empirical-analysis-of-the-hessian-of-over-parametrized-neural-networks"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2017&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, Leon Bottou&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1706.04454&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1706.04454"&gt;https://arxiv.org/abs/1706.04454&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="2-the-full-spectrum-of-deepnet-hessians-at-scale-dynamics-with-sgd-training-and-sample-size"&gt;
 2. The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size&lt;span class="heading__anchor"&gt; &lt;a href="#2-the-full-spectrum-of-deepnet-hessians-at-scale-dynamics-with-sgd-training-and-sample-size"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2018&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Vardan Papyan&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1811.07062&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1811.07062"&gt;https://arxiv.org/abs/1811.07062&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We apply state-of-the-art tools in modern high-dimensional numerical linear algebra to approximate efficiently the spectrum of the Hessian of modern deepnets, with tens of millions of parameters, trained on real data. Our results corroborate previous findings, based on small-scale networks, that the Hessian exhibits &amp;ldquo;spiked&amp;rdquo; behavior, with several outliers isolated from a continuous bulk. We decompose the Hessian into different components and study the dynamics with training and sample size of each term individually.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="3-pyhessian-neural-networks-through-the-lens-of-the-hessian"&gt;
 3. PyHessian: Neural Networks Through the Lens of the Hessian&lt;span class="heading__anchor"&gt; &lt;a href="#3-pyhessian-neural-networks-through-the-lens-of-the-hessian"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2019&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1912.07145&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1912.07145"&gt;https://arxiv.org/abs/1912.07145&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape smoother, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Mentions &amp;lsquo;available&amp;rsquo; in abstract; Mentions &amp;lsquo;open source&amp;rsquo; in abstract; Known repository: &lt;a href="https://github.com/amirgholami/PyHessian"&gt;https://github.com/amirgholami/PyHessian&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="4-a-deeper-look-at-the-hessian-eigenspectrum-of-deep-neural-networks-and-its-applications-to-regularization"&gt;
 4. A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization&lt;span class="heading__anchor"&gt; &lt;a href="#4-a-deeper-look-at-the-hessian-eigenspectrum-of-deep-neural-networks-and-its-applications-to-regularization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2020&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, Vineeth N Balasubramanian&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2012.03801&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2012.03801"&gt;https://arxiv.org/abs/2012.03801&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; Loss landscape analysis is extremely useful for a deeper understanding of the generalization ability of deep neural network models. In this work, we propose a layerwise loss landscape analysis where the loss surface at every layer is studied independently and also on how each correlates to the overall loss surface. We study the layerwise loss landscape by studying the eigenspectra of the Hessian at each layer. In particular, our results show that the layerwise Hessian geometry is largely similar to the entire Hessian. We also report an interesting phenomenon where the Hessian eigenspectrum of middle layers of the deep neural network are observed to most similar to the overall Hessian eigenspectrum. We also show that the maximum eigenvalue and the trace of the Hessian (both full network and layerwise) reduce as training of the network progresses. We leverage on these observations to propose a new regularizer based on the trace of the layerwise Hessian. Penalizing the trace of the Hessian at every layer indirectly forces Stochastic Gradient Descent to converge to flatter minima, which are shown to have better generalization performance. In particular, we show that such a layerwise regularizer can be leveraged to penalize the middlemost layers alone, which yields promising results. Our empirical studies on well-known deep nets across datasets support the claims of this work&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="diagonal-scaling"&gt;
 Diagonal Scaling&lt;span class="heading__anchor"&gt; &lt;a href="#diagonal-scaling"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-adahessian-an-adaptive-second-order-optimizer-for-machine-learning"&gt;
 1. AdaHessian: An Adaptive Second Order Optimizer for Machine Learning&lt;span class="heading__anchor"&gt; &lt;a href="#1-adahessian-an-adaptive-second-order-optimizer-for-machine-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2020&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2006.00719&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; AdaHessian&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2006.00719"&gt;https://arxiv.org/abs/2006.00719&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Known repository: &lt;a href="https://github.com/amirgholami/adahessian"&gt;https://github.com/amirgholami/adahessian&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="2-sophia-a-scalable-stochastic-second-order-optimizer-for-language-model-pre-training"&gt;
 2. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training&lt;span class="heading__anchor"&gt; &lt;a href="#2-sophia-a-scalable-stochastic-second-order-optimizer-for-language-model-pre-training"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2023&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2305.14342&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; Sophia&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2305.14342"&gt;https://arxiv.org/abs/2305.14342&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Known repository: &lt;a href="https://github.com/Liuhong99/Sophia"&gt;https://github.com/Liuhong99/Sophia&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="hessian-free-optimization"&gt;
 Hessian-free Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#hessian-free-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-learning-recurrent-neural-networks-with-hessian-free-optimization"&gt;
 1. Learning Recurrent Neural Networks with Hessian-Free Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#1-learning-recurrent-neural-networks-with-hessian-free-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2011&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; James Martens, Ilya Sutskever&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://www.cs.toronto.edu/~jmartens/docs/RNN_HF.pdf"&gt;https://www.cs.toronto.edu/~jmartens/docs/RNN_HF.pdf&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies. Utilizing recent advances in the Hessian-free optimization approach (Martens, 2010), together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems. First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely long-term dependencies), and second, on three natural and highly complex real-world sequence datasets where we find that our method significantly outperforms the previous state-of-the-art method for training neural sequence models: the Long Short-term Memory approach of Hochreiter and Schmidhuber (1997). Additionally, we offer a new interpretation of the generalized Gauss-Newton matrix of Schraudolph (2002) which is used within the HF approach of Martens.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="2-training-neural-networks-with-stochastic-hessian-free-optimization"&gt;
 2. Training Neural Networks with Stochastic Hessian-Free Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#2-training-neural-networks-with-stochastic-hessian-free-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2013&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Ryan Kiros&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1301.3641&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; SHF&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1301.3641"&gt;https://arxiv.org/abs/1301.3641&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens&amp;rsquo; HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Mentions &amp;lsquo;code&amp;rsquo; in abstract&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="quasi-newton"&gt;
 Quasi-Newton&lt;span class="heading__anchor"&gt; &lt;a href="#quasi-newton"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-a-stochastic-quasi-newton-method-for-large-scale-optimization"&gt;
 1. A Stochastic Quasi-Newton Method for Large-Scale Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#1-a-stochastic-quasi-newton-method-for-large-scale-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2014&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; R.H. Byrd, S.L. Hansen, J. Nocedal, Y. Singer&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1401.7020&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1401.7020"&gt;https://arxiv.org/abs/1401.7020&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="2-a-multi-batch-l-bfgs-method-for-machine-learning"&gt;
 2. A Multi-Batch L-BFGS Method for Machine Learning&lt;span class="heading__anchor"&gt; &lt;a href="#2-a-multi-batch-l-bfgs-method-for-machine-learning"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2016&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Albert S. Berahas, Jorge Nocedal, Martin Takáč&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1605.06049&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1605.06049"&gt;https://arxiv.org/abs/1605.06049&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="3-stochastic-quasi-newton-with-line-search-regularization"&gt;
 3. Stochastic Quasi-Newton with Line-Search Regularization&lt;span class="heading__anchor"&gt; &lt;a href="#3-stochastic-quasi-newton-with-line-search-regularization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2019&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Adrian Wills, Thomas Schön&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1909.01238&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; SQN&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1909.01238"&gt;https://arxiv.org/abs/1909.01238&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; In this paper we present a novel quasi-Newton algorithm for use in stochastic optimisation. Quasi-Newton methods have had an enormous impact on deterministic optimisation problems because they afford rapid convergence and computationally attractive algorithms. In essence, this is achieved by learning the second-order (Hessian) information based on observing first-order gradients. We extend these ideas to the stochastic setting by employing a highly flexible model for the Hessian and infer its value based on observing noisy gradients. In addition, we propose a stochastic counterpart to standard line-search procedures and demonstrate the utility of this combination on maximum likelihood identification for general nonlinear state space models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="4-practical-quasi-newton-methods-for-training-deep-neural-networks"&gt;
 4. Practical Quasi-Newton Methods for Training Deep Neural Networks&lt;span class="heading__anchor"&gt; &lt;a href="#4-practical-quasi-newton-methods-for-training-deep-neural-networks"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2020&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Donald Goldfarb, Yi Ren, Achraf Bahamou&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2006.08877&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2006.08877"&gt;https://arxiv.org/abs/2006.08877&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n \times n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Mentions &amp;lsquo;code&amp;rsquo; in abstract; Mentions &amp;lsquo;implementation&amp;rsquo; in abstract&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="gauss-newton"&gt;
 Gauss-Newton&lt;span class="heading__anchor"&gt; &lt;a href="#gauss-newton"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-efficient-subsampled-gauss-newton-and-natural-gradient-methods-for-training-neural-networks"&gt;
 1. Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks&lt;span class="heading__anchor"&gt; &lt;a href="#1-efficient-subsampled-gauss-newton-and-natural-gradient-methods-for-training-neural-networks"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2019&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Yi Ren, Donald Goldfarb&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1906.02353&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; SWM-GN, SWM-NG&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1906.02353"&gt;https://arxiv.org/abs/1906.02353&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We present practical Levenberg-Marquardt variants of Gauss-Newton and natural gradient methods for solving non-convex optimization problems that arise in training deep neural networks involving enormous numbers of variables and huge data sets. Our methods use subsampled Gauss-Newton or Fisher information matrices and either subsampled gradient estimates (fully stochastic) or full gradients (semi-stochastic), which, in the latter case, we prove convergent to a stationary point. By using the Sherman-Morrison-Woodbury formula with automatic differentiation (backpropagation) we show how our methods can be implemented to perform efficiently. Finally, numerical results are presented to demonstrate the effectiveness of our proposed methods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="2-on-the-promise-of-the-stochastic-generalized-gauss-newton-method-for-training-dnns"&gt;
 2. On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs&lt;span class="heading__anchor"&gt; &lt;a href="#2-on-the-promise-of-the-stochastic-generalized-gauss-newton-method-for-training-dnns"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2020&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Matilde Gargiani, et al.&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2006.02409&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; SGN&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2006.02409"&gt;https://arxiv.org/abs/2006.02409&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training DNNs. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to compute an approximate search direction, relies on the conjugate gradient method combined with forward and reverse automatic differentiation. Despite the success of SGD and its first-order variants, and despite Hessian-free methods based on the Gauss-Newton Hessian approximation having been already theoretically proposed as practical methods for training DNNs, we believe that SGN has a lot of undiscovered and yet not fully displayed potential in big mini-batch scenarios. For this setting, we demonstrate that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime. This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform, which, unlike Tensorflow and Pytorch, supports forward automatic differentiation. This enables researchers to further study and improve this promising optimization technique and hopefully reconsider stochastic second-order methods as competitive optimization techniques for training DNNs; we also hope that the promise of SGN may lead to forward automatic differentiation being added to Tensorflow or Pytorch. Our results also show that in big mini-batch scenarios SGN is more robust than SGD with respect to its hyperparameters (we never had to tune its step-size for our benchmarks!), which eases the expensive process of hyperparameter tuning that is instead crucial for the performance of first-order methods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Mentions &amp;lsquo;implementation&amp;rsquo; in abstract&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="3-stochastic-gauss-newton-algorithms-for-nonconvex-compositional-optimization"&gt;
 3. Stochastic Gauss-Newton Algorithms for Nonconvex Compositional Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#3-stochastic-gauss-newton-algorithms-for-nonconvex-compositional-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2020&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Quoc Tran-Dinh, et al.&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2002.07290&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; SGN with SARAH estimators&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2002.07290"&gt;https://arxiv.org/abs/2002.07290&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We develop two new stochastic Gauss-Newton algorithms for solving a class of non-convex stochastic compositional optimization problems frequently arising in practice. We consider both the expectation and finite-sum settings under standard assumptions, and use both classical stochastic and SARAH estimators for approximating function values and Jacobians. In the expectation case, we establish $\mathcal{O}(\varepsilon^{-2})$ iteration-complexity to achieve a stationary point in expectation and estimate the total number of stochastic oracle calls for both function value and its Jacobian, where $\varepsilon$ is a desired accuracy. In the finite sum case, we also estimate $\mathcal{O}(\varepsilon^{-2})$ iteration-complexity and the total oracle calls with high probability. To our best knowledge, this is the first time such global stochastic oracle complexity is established for stochastic Gauss-Newton methods. Finally, we illustrate our theoretical results via two numerical examples on both synthetic and real datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="4-nonlinear-least-squares-for-large-scale-machine-learning-using-stochastic-jacobian-estimates"&gt;
 4. Nonlinear Least Squares for Large-Scale Machine Learning using Stochastic Jacobian Estimates&lt;span class="heading__anchor"&gt; &lt;a href="#4-nonlinear-least-squares-for-large-scale-machine-learning-using-stochastic-jacobian-estimates"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2021&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Johannes J. Brust&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2107.05598&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; NLLS1, NLLSL&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2107.05598"&gt;https://arxiv.org/abs/2107.05598&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; For large nonlinear least squares loss functions in machine learning we exploit the property that the number of model parameters typically exceeds the data in one batch. This implies a low-rank structure in the Hessian of the loss, which enables effective means to compute search directions. Using this property, we develop two algorithms that estimate Jacobian matrices and perform well when compared to state-of-the-art methods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="5-improving-levenberg-marquardt-algorithm-for-neural-networks"&gt;
 5. Improving Levenberg-Marquardt Algorithm for Neural Networks&lt;span class="heading__anchor"&gt; &lt;a href="#5-improving-levenberg-marquardt-algorithm-for-neural-networks"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2022&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Omead Pooladzandi, Yiming Zhou&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2212.08769&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; LM&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2212.08769"&gt;https://arxiv.org/abs/2212.08769&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We explore the usage of the Levenberg-Marquardt (LM) algorithm for regression (non-linear least squares) and classification (generalized Gauss-Newton methods) tasks in neural networks. We compare the performance of the LM method with other popular first-order algorithms such as SGD and Adam, as well as other second-order algorithms such as L-BFGS , Hessian-Free and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="6-rethinking-gauss-newton-for-learning-over-parameterized-models"&gt;
 6. Rethinking Gauss-Newton for learning over-parameterized models&lt;span class="heading__anchor"&gt; &lt;a href="#6-rethinking-gauss-newton-for-learning-over-parameterized-models"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2023&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Michael Arbel, et al.&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2302.02904&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2302.02904"&gt;https://arxiv.org/abs/2302.02904&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; This work studies the global convergence and implicit bias of Gauss Newton&amp;rsquo;s (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN&amp;rsquo;s method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h3 class="heading" id="7-exact-gauss-newton-optimization-for-training-deep-neural-networks"&gt;
 7. Exact Gauss-Newton Optimization for Training Deep Neural Networks&lt;span class="heading__anchor"&gt; &lt;a href="#7-exact-gauss-newton-optimization-for-training-deep-neural-networks"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2024&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2405.14402&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; EGN&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2405.14402"&gt;https://arxiv.org/abs/2405.14402&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $\epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="fisher-information"&gt;
 Fisher Information&lt;span class="heading__anchor"&gt; &lt;a href="#fisher-information"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-optimizing-neural-networks-with-kronecker-factored-approximate-curvature"&gt;
 1. Optimizing Neural Networks with Kronecker-factored Approximate Curvature&lt;span class="heading__anchor"&gt; &lt;a href="#1-optimizing-neural-networks-with-kronecker-factored-approximate-curvature"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2015&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; James Martens, Roger Grosse&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:1503.05671&lt;br&gt;
&lt;strong&gt;Algorithm:&lt;/strong&gt; K-FAC&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/1503.05671"&gt;https://arxiv.org/abs/1503.05671&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network&amp;rsquo;s Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC&amp;rsquo;s approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; Known repository: Various implementations available&lt;/p&gt;
&lt;hr&gt;
&lt;h2 class="heading" id="other"&gt;
 Other&lt;span class="heading__anchor"&gt; &lt;a href="#other"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;h3 class="heading" id="1-second-order-optimization-with-lazy-hessians"&gt;
 1. Second-order optimization with lazy Hessians&lt;span class="heading__anchor"&gt; &lt;a href="#1-second-order-optimization-with-lazy-hessians"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Year:&lt;/strong&gt; 2022&lt;br&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Nikita Doikov, El Mahdi Chayti, Martin Jaggi&lt;br&gt;
&lt;strong&gt;ArXiv ID:&lt;/strong&gt; arXiv:2212.00781&lt;br&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2212.00781"&gt;https://arxiv.org/abs/2212.00781&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; We analyze Newton&amp;rsquo;s method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every $d$ iterations, where $d$ is the dimension of the problem. This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt; No explicit source code information found&lt;/p&gt;
&lt;hr&gt;</description></item><item><title>Machine Learning &amp; Combinatorial Optimization</title><link>https://blog.namln.org/research/ml-co/</link><pubDate>Sat, 08 Apr 2023 00:00:00 +0000</pubDate><guid>https://blog.namln.org/research/ml-co/</guid><description>&lt;p&gt;A comprehensive overview of machine learning approaches and techniques applied to combinatorial optimization problems, covering foundational concepts, methodologies, and state-of-the-art advances.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scope&lt;/strong&gt;: Systematic review of learning-based CO solving methods including supervised learning for heuristics, reinforcement learning for search policies, and hybrid approaches combining classical and neural methods.&lt;/p&gt;
&lt;h2 class="heading" id="graph-matching"&gt;
 Graph Matching&lt;span class="heading__anchor"&gt; &lt;a href="#graph-matching"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;The problem of finding correspondences between vertices in two graphs, with applications in pattern recognition, shape analysis, and image matching. Deep learning methods have enabled scalable solutions for large graphs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Given graphs $G_1 = (V_1, E_1)$ and $G_2 = (V_2, E_2)$, find a correspondence $\pi: V_1 \to V_2$ that maximizes structural similarity, typically measured by the number of preserved edge relationships or minimizing matching cost.&lt;/p&gt;
&lt;h2 class="heading" id="quadratic-assignment-problem"&gt;
 Quadratic Assignment Problem&lt;span class="heading__anchor"&gt; &lt;a href="#quadratic-assignment-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;An NP-hard optimization problem that assigns n facilities to n locations to minimize total cost, where costs depend on pairwise assignments. Classical applications include facility layout and keyboard design.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{i=1}^{n} \sum_{j=1}^{n} c_{ij} x_{\pi(i)j}$ where $\pi$ is a permutation of locations, subject to assignment constraints where each facility is assigned to exactly one location.&lt;/p&gt;
&lt;h2 class="heading" id="travelling-salesman-problem"&gt;
 Travelling Salesman Problem&lt;span class="heading__anchor"&gt; &lt;a href="#travelling-salesman-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;One of the most studied combinatorial optimization problems, seeking the shortest route visiting all cities exactly once. Neural approaches and learning-based heuristics have shown competitive performance compared to traditional methods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{i=1}^{n} d(c_{\pi(i)}, c_{\pi(i+1 \bmod n)})$ where $\pi$ is a permutation of $n$ cities and $d$ is the distance function, subject to visiting each city exactly once.&lt;/p&gt;
&lt;h2 class="heading" id="portfolio-optimization"&gt;
 Portfolio Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#portfolio-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Financial optimization for asset allocation, determining optimal portfolio composition to maximize returns while managing risk and satisfying investment constraints.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Maximize $\mathbf{w}^T \boldsymbol{\mu} - \lambda \mathbf{w}^T \Sigma \mathbf{w}$ subject to $\sum w_i = 1$ and $w_i \geq 0$, where $\mathbf{w}$ are weights, $\boldsymbol{\mu}$ expected returns, $\Sigma$ covariance matrix, and $\lambda$ risk aversion.&lt;/p&gt;
&lt;h2 class="heading" id="maximal-cut"&gt;
 Maximal Cut&lt;span class="heading__anchor"&gt; &lt;a href="#maximal-cut"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;The problem of partitioning graph vertices into two sets to maximize edges between partitions. A fundamental graph problem with applications in circuit design and network optimization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Partition vertices $V$ into disjoint sets $S$ and $\bar{S}$ to maximize $|{(u,v) \in E : u \in S, v \in \bar{S}}|$, or equivalently maximize $\sum_{(u,v) \in E} x_u(1-x_v)$ where $x_i \in {0,1}$.&lt;/p&gt;
&lt;h2 class="heading" id="vehicle-routing-problem"&gt;
 Vehicle Routing Problem&lt;span class="heading__anchor"&gt; &lt;a href="#vehicle-routing-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Optimizing routes for a fleet of vehicles to serve customers with minimum distance/cost. Extensions include time windows, capacity constraints, and multiple depots, common in logistics and delivery services.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{k=1}^{K} \sum_{i,j} c_{ij} x_{ijk}$ subject to each customer visited by exactly one vehicle, vehicle capacity constraints $\sum_{i \in R_k} d_i \leq C_k$, and flow conservation constraints where $x_{ijk}$ indicates if vehicle $k$ travels from $i$ to $j$.&lt;/p&gt;
&lt;h2 class="heading" id="job-shop-scheduling-problem"&gt;
 Job Shop Scheduling Problem&lt;span class="heading__anchor"&gt; &lt;a href="#job-shop-scheduling-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Scheduling jobs on machines to minimize completion time while respecting precedence and machine constraints. A fundamental problem in manufacturing and production planning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize makespan $C_{max}$ subject to: each job $j$ consists of operations that must be processed in order on specified machines, each machine can process at most one operation at a time, and operation durations are fixed.&lt;/p&gt;
&lt;h2 class="heading" id="maximum-independent-set"&gt;
 Maximum Independent Set&lt;span class="heading__anchor"&gt; &lt;a href="#maximum-independent-set"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Finding the largest set of vertices with no edges between them in a graph. An NP-hard problem with applications in scheduling, coding theory, and network design.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Maximize $\sum_{i=1}^{n} x_i$ subject to $x_i + x_j \leq 1$ for all $(i,j) \in E$ and $x_i \in {0,1}$, where $x_i = 1$ if vertex $i$ is in the set.&lt;/p&gt;
&lt;h2 class="heading" id="generalization"&gt;
 Generalization&lt;span class="heading__anchor"&gt; &lt;a href="#generalization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Studying how machine learning solvers generalize across different problem instances and scales, and developing methods that handle adversarial or out-of-distribution scenarios.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Train model $\theta$ on distribution $D_{train}$ minimizing $\mathbb{E}&lt;em&gt;{\mathbf{x} \sim D&lt;/em&gt;{train}}[\ell(f_\theta(\mathbf{x}), y^&lt;em&gt;)]$ such that test error $\mathbb{E}&lt;em&gt;{\mathbf{x} \sim D&lt;/em&gt;{test}}[\ell(f_\theta(\mathbf{x}), y^&lt;/em&gt;)]$ remains small for $D_{test}$ different from $D_{train}$ (different sizes, perturbations).&lt;/p&gt;
&lt;h2 class="heading" id="orienteering-problem"&gt;
 Orienteering Problem&lt;span class="heading__anchor"&gt; &lt;a href="#orienteering-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;A variant of the traveling salesman problem where a subset of vertices must be selected to maximize profit while respecting a distance constraint. Applications include tourist route planning and project selection.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Maximize $\sum_{i \in S} p_i$ subject to the total travel distance $\sum_{i,j \in S} d_{ij} \leq L$ where $S \subseteq V$ is selected vertices, $p_i$ are profits, and $L$ is distance limit.&lt;/p&gt;
&lt;h2 class="heading" id="knapsack"&gt;
 Knapsack&lt;span class="heading__anchor"&gt; &lt;a href="#knapsack"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;The problem of selecting items with given weights and values to maximize total value within a weight capacity. A fundamental dynamic programming problem with numerous variants (0/1, bounded, unbounded).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Maximize $\sum_{i=1}^{n} v_i x_i$ subject to $\sum_{i=1}^{n} w_i x_i \leq W$ and $x_i \in {0,1}$, where $v_i$ are values, $w_i$ are weights, and $W$ is capacity.&lt;/p&gt;
&lt;h2 class="heading" id="computing-resource-allocation"&gt;
 Computing Resource Allocation&lt;span class="heading__anchor"&gt; &lt;a href="#computing-resource-allocation"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Optimal allocation of computational resources (CPU, memory, bandwidth) across tasks or virtual machines to maximize utilization while meeting performance requirements.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Maximize $\sum_{t=1}^{T} u_t$ subject to $\sum_{t \in task_i} r_{t,d} \leq R_{i,d}$ for each device $d$, latency constraints $L_t \leq L_{max,t}$, where $u_t$ is utility and $r_{t,d}$ is resource $d$ for task $t$.&lt;/p&gt;
&lt;h2 class="heading" id="bin-packing-problem"&gt;
 Bin Packing Problem&lt;span class="heading__anchor"&gt; &lt;a href="#bin-packing-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Packing items of varying sizes into a minimum number of bins, a classic problem in logistics and resource management. Variants include 2D and 3D packing with practical applications in shipping and manufacturing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{b=1}^{B} y_b$ subject to $\sum_{i \in b} s_i \leq C \cdot y_b$ for each bin $b$, where $s_i$ is item size, $C$ is bin capacity, $y_b \in {0,1}$ indicates if bin is used.&lt;/p&gt;
&lt;h2 class="heading" id="graph-edit-distance"&gt;
 Graph Edit Distance&lt;span class="heading__anchor"&gt; &lt;a href="#graph-edit-distance"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Measuring the dissimilarity between two graphs as the minimum cost of edit operations (insertions, deletions, substitutions) needed to transform one into another. Used in pattern recognition and molecule comparison.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: $GED(G_1, G_2) = \min_{\xi} \sum_{op \in \xi} cost(op)$ where $\xi$ is an edit path transforming $G_1$ to $G_2$, and cost is the sum of operation costs (vertex/edge insertion, deletion, substitution).&lt;/p&gt;
&lt;h2 class="heading" id="hamiltonian-cycle-problem"&gt;
 Hamiltonian Cycle Problem&lt;span class="heading__anchor"&gt; &lt;a href="#hamiltonian-cycle-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Finding a cycle that visits every vertex exactly once in an undirected graph. A fundamental NP-complete problem related to the traveling salesman problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Determine if there exists a cycle in graph $G = (V, E)$ that visits every vertex in $V$ exactly once. Decision problem: is a Hamiltonian cycle present?&lt;/p&gt;
&lt;h2 class="heading" id="graph-coloring"&gt;
 Graph Coloring&lt;span class="heading__anchor"&gt; &lt;a href="#graph-coloring"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Assigning colors to vertices such that no adjacent vertices share the same color, using the minimum number of colors. Applications include scheduling, register allocation, and map coloring.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $k$ such that $c: V \to {1, &amp;hellip;, k}$ where $c(u) \neq c(v)$ for all $(u,v) \in E$, i.e., find the chromatic number $\chi(G)$.&lt;/p&gt;
&lt;h2 class="heading" id="maximal-common-subgraph"&gt;
 Maximal Common Subgraph&lt;span class="heading__anchor"&gt; &lt;a href="#maximal-common-subgraph"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Finding the largest subgraph isomorphic to both input graphs, useful in molecular structure comparison and pattern discovery applications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Find subgraph $G_{mcs} = (V_{mcs}, E_{mcs})$ that is isomorphic to subgraphs of both $G_1$ and $G_2$, maximizing $|V_{mcs}|$ (or $|E_{mcs}|$).&lt;/p&gt;
&lt;h2 class="heading" id="influence-maximization"&gt;
 Influence Maximization&lt;span class="heading__anchor"&gt; &lt;a href="#influence-maximization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Selecting a subset of nodes in a social network to maximize the spread of information or influence through the network. A key problem in viral marketing and network analysis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Select subset $S \subseteq V$ with $|S| \leq k$ to maximize expected spread $f(S)$, where $f(S) = E[|T(S)|]$ is the expected number of influenced nodes given initial set $S$.&lt;/p&gt;
&lt;h2 class="heading" id="boolean-satisfiability"&gt;
 Boolean Satisfiability&lt;span class="heading__anchor"&gt; &lt;a href="#boolean-satisfiability"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Determining if a boolean formula can be satisfied, one of the most studied NP-complete problems. Recent neural approaches have shown promise for both solving and reasoning about SAT instances.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Given boolean formula $\phi$ in conjunctive normal form (CNF) with $m$ clauses over $n$ variables, determine if there exists an assignment $\mathbf{x} \in {0,1}^n$ such that $\phi(\mathbf{x}) = \text{true}$.&lt;/p&gt;
&lt;h2 class="heading" id="max-clique"&gt;
 Max Clique&lt;span class="heading__anchor"&gt; &lt;a href="#max-clique"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Finding the largest clique (complete subgraph) in an undirected graph. An NP-hard problem with applications in social network analysis and bioinformatics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Maximize $\sum_{i=1}^{n} x_i$ subject to $x_i + x_j \leq 1 + \mathbb{1}_{(i,j) \in E}$ for all $i &amp;lt; j$ and $x_i \in {0,1}$, finding largest complete subgraph.&lt;/p&gt;
&lt;h2 class="heading" id="mixed-integer-programming"&gt;
 Mixed Integer Programming&lt;span class="heading__anchor"&gt; &lt;a href="#mixed-integer-programming"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Optimizing linear objective functions subject to linear constraints where some variables must be integers. A general framework encompassing many CO problems, widely used in operations research.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\mathbf{c}^T \mathbf{x}$ subject to $A\mathbf{x} \leq \mathbf{b}$, $\mathbf{x} \geq \mathbf{0}$, and $x_i \in \mathbb{Z}$ for $i \in I$, where $I$ indicates integer-constrained variables.&lt;/p&gt;
&lt;h2 class="heading" id="causal-discovery"&gt;
 Causal Discovery&lt;span class="heading__anchor"&gt; &lt;a href="#causal-discovery"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Learning the underlying causal structure from observational data, identifying causal relationships between variables. Important for understanding complex systems in medicine, economics, and science.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Learn directed acyclic graph (DAG) $G = (V, E)$ from observational data where edge $(i \to j) \in E$ indicates $i$ causally influences $j$. Goal: identify true DAG $G^*$ minimizing score $S(G | \mathbf{D})$ subject to acyclicity constraint.&lt;/p&gt;
&lt;h2 class="heading" id="game-theoretic-semantics"&gt;
 Game Theoretic Semantics&lt;span class="heading__anchor"&gt; &lt;a href="#game-theoretic-semantics"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;A game-based interpretation of logical formulas where truth is determined by winning strategies in semantic games, providing computational game-theoretic perspectives on logic and reasoning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: For formula $\phi$ in language $L$, define semantic game where two players (verifier and falsifier) move according to formula structure. Formula is true in structure if verifier has winning strategy.&lt;/p&gt;
&lt;h2 class="heading" id="differentiable-optimization"&gt;
 Differentiable Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#differentiable-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Making optimization layers differentiable so they can be embedded in neural networks, enabling end-to-end learning where optimization problems become trainable components of deep models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Given parametric optimization problem $y^* = \arg\min_y f(y; \theta)$, compute implicit gradient $\frac{\partial y^&lt;em&gt;}{\partial \theta}$ using implicit differentiation: $\nabla_\theta y^&lt;/em&gt; = -[\nabla_y^2 f]^{-1} \nabla_{\theta,y}^2 f$ enabling backpropagation through optimizer.&lt;/p&gt;
&lt;h2 class="heading" id="car-dispatch"&gt;
 Car Dispatch&lt;span class="heading__anchor"&gt; &lt;a href="#car-dispatch"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Optimally assigning vehicles to passenger requests in ride-hailing and autonomous driving systems, minimizing empty miles and response times.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Assign requests $R$ to vehicles $V$ minimizing $\sum_{r \in R} (\alpha \cdot ETA_r + \beta \cdot det_{r})$ subject to vehicle capacity $|A_v| \leq C_v$, time window constraints on pickups/dropoffs, and driver constraints.&lt;/p&gt;
&lt;h2 class="heading" id="conjunctive-query-containment"&gt;
 Conjunctive Query Containment&lt;span class="heading__anchor"&gt; &lt;a href="#conjunctive-query-containment"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;A fundamental problem in database theory and reasoning, determining whether one query result is guaranteed to be a subset of another query&amp;rsquo;s result.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Given conjunctive queries $Q_1, Q_2$ over schema, determine if $\text{ans}(Q_1, I) \subseteq \text{ans}(Q_2, I)$ for all possible database instances $I$. Equivalently, check if there exists homomorphism from $Q_2$ to $Q_1$.&lt;/p&gt;
&lt;h2 class="heading" id="virtual-network-embedding"&gt;
 Virtual Network Embedding&lt;span class="heading__anchor"&gt; &lt;a href="#virtual-network-embedding"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Mapping virtual network components (nodes and links) onto physical infrastructure, optimizing resource utilization and quality of service in cloud computing and network management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Map virtual network $G_v = (V_v, E_v)$ to substrate network $G_s = (V_s, E_s)$ by finding embedding $e_n: V_v \to V_s$ and $e_l: E_v \to P(E_s)$ minimizing resource usage while ensuring capacity constraints.&lt;/p&gt;
&lt;h2 class="heading" id="predictoptimize"&gt;
 Predict+Optimize&lt;span class="heading__anchor"&gt; &lt;a href="#predictoptimize"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Decision-focused learning that integrates prediction and optimization into a unified framework, optimizing predictions for decision quality rather than traditional accuracy metrics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Train predictor $f_\theta$ to minimize task loss $\mathcal{L}(y^&lt;em&gt;(f_\theta(\mathbf{x})), y^&lt;/em&gt;&lt;em&gt;{opt}) = \mathcal{L}(\arg\min_y f(y; f&lt;/em&gt;\theta(\mathbf{x})), y^&lt;em&gt;_{opt})$, where $y^&lt;/em&gt;_{opt}$ is optimal decision under true parameters, using implicit differentiation through optimization layer.&lt;/p&gt;
&lt;h2 class="heading" id="optimal-power-flow"&gt;
 Optimal Power Flow&lt;span class="heading__anchor"&gt; &lt;a href="#optimal-power-flow"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Determining optimal setpoints for generators in power systems to supply electricity while minimizing costs and satisfying physical constraints, fundamental for smart grid management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{g=1}^{G} (a_g + b_g P_g + c_g P_g^2)$ subject to power balance $P_i = \sum_g P_g - L_i$, voltage constraints $|V_i| \in [V_{min}, V_{max}]$, and transmission limits.&lt;/p&gt;
&lt;h2 class="heading" id="facility-location-problem"&gt;
 Facility Location Problem&lt;span class="heading__anchor"&gt; &lt;a href="#facility-location-problem"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Determining optimal locations for facilities (warehouses, hospitals, schools) to serve customers, minimizing total distance and facility opening costs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{j=1}^{m} f_j y_j + \sum_{i=1}^{n} \sum_{j=1}^{m} c_{ij} x_{ij}$ subject to $\sum_j x_{ij} = 1$ (serve all customers), $x_{ij} \leq y_j$ (assignment constraints), and $y_j \in {0,1}$.&lt;/p&gt;
&lt;h2 class="heading" id="sorting--ranking"&gt;
 Sorting &amp;amp; Ranking&lt;span class="heading__anchor"&gt; &lt;a href="#sorting--ranking"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Differentiable sorting and ranking operations that can be integrated into neural networks, enabling permutation-based learning and differentiable ranking optimization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Approximate permutation matrix $P \in \mathbb{R}^{n \times n}$ where $P\mathbf{x}$ sorts vector $\mathbf{x}$ in differentiable manner, or compute ranking scores $r_i$ for items proportional to quality or preference.&lt;/p&gt;
&lt;h2 class="heading" id="combinatorial-drug-recommendation"&gt;
 Combinatorial Drug Recommendation&lt;span class="heading__anchor"&gt; &lt;a href="#combinatorial-drug-recommendation"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Finding optimal combinations of drugs to maximize therapeutic efficacy while minimizing adverse interactions, a key application in personalized medicine and drug discovery.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Select drug subset $S \subseteq D$ to maximize efficacy $f(S)$ subject to safety constraint (drug interactions) $g(S) \leq \epsilon$ and cardinality limit $|S| \leq k$.&lt;/p&gt;
&lt;h2 class="heading" id="stochastic-combinatorial-optimization"&gt;
 Stochastic Combinatorial Optimization&lt;span class="heading__anchor"&gt; &lt;a href="#stochastic-combinatorial-optimization"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Addressing CO problems with random or uncertain parameters, developing robust or adaptive solutions that perform well under uncertainty and variability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\mathbb{E}[f(\mathbf{x}, \boldsymbol{\xi})]$ over decision $\mathbf{x} \in X$ where $\boldsymbol{\xi}$ is random parameter vector, or find robust solution $\mathbb{x}^* = \arg\min_\mathbf{x} \max_{\boldsymbol{\xi} \in U} f(\mathbf{x}, \boldsymbol{\xi})$.&lt;/p&gt;
&lt;h2 class="heading" id="vertex-cover"&gt;
 Vertex Cover&lt;span class="heading__anchor"&gt; &lt;a href="#vertex-cover"&gt;#&lt;/a&gt;&lt;/span&gt;
&lt;/h2&gt;&lt;p&gt;Finding the minimum set of vertices that covers all edges in a graph. A fundamental NP-hard problem with applications in network design and bioinformatics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Formulation&lt;/strong&gt;: Minimize $\sum_{i=1}^{n} x_i$ subject to $x_i + x_j \geq 1$ for all $(i,j) \in E$ and $x_i \in {0,1}$, where $x_i = 1$ if vertex $i$ is in the cover.&lt;/p&gt;</description></item></channel></rss>