Machine Learning on Nam Le

Mathematics - Optimization

Thu, 27 Jun 2024 23:14:15 +0800

Branches of Optimization Research #

Convex Optimization #

Convex optimization focuses on problems where the objective function and constraints are convex, ensuring a single global optimum. This field is foundational in machine learning, signal processing, and control systems due to its guaranteed convergence and efficient algorithms.

Convex Optimization by Boyd and Vandenberghe - PDF
Convex Optimization Theory by Dimitri P. Bertsekas - PDF

Discrete, Combinatorial, and Integer Optimization #

This branch deals with optimization problems involving discrete variables, such as integers or combinatorial structures, often encountered in scheduling, network design, and logistics. Bayesian optimization, a subset, is particularly useful for optimizing expensive black-box functions.

Bayesian Optimization In Action by Quan Nguyen - Amazon
Experimentation for Engineers by David Sweet - Amazon

Operations Research #

Operations research applies mathematical modeling and optimization to complex decision-making in logistics, supply chain, and resource allocation. It integrates techniques like linear programming, simulation, and heuristic methods to optimize real-world systems.

Operations Research An Introduction by Hamdy A. Taha - Pearson
Introduction to Operations Research by Frederick Hillier and Gerald Lieberman - McGraw Hill
Julia Programming for Operations Research by Changhyun Kwon - PDF - code
Mathematical Programming and Operations Research: Modeling, Algorithms, and Complexity. Examples in Python and Julia. Edited by Robert Hildebrand - PDF
A First Course in Linear Optimization by Jon Lee - PDF
Decomposition Techniques in Mathematical Programming by Conejo , Castillo , Mínguez , and García-Bertrand - Springer
Algorithms for Optimization by Mykel J. Kochenderfer and Tim A. Wheeler - PDF
Model Building in Mathematical Programming - Introductory modeling book by H. Paul Williams - Wiley

Meta-heuristics #

Meta-heuristics are high-level strategies for solving complex optimization problems where exact methods are computationally infeasible. They include nature-inspired algorithms like genetic algorithms and simulated annealing, widely used in engineering and data science.

Metaheuristics by Patrick Siarry - Springer (open access)
Essentials of Metaheuristics by Sean Luke - link
Handbook of Metaheuristics by Michel Gendreau and Jean-Yves Potvin - Springer (open access)
An Introduction to Metaheuristics for Optimization by Bastien Chopard , Marco Tomassini - Springer (open access)
Metaheuristic and Evolutionary Computation: Algorithms and Applications by Hasmat Malik, Atif Iqbal, Puneet Joshi, Sanjay Agrawal, and Farhad Ilahi Bakhsh - Springer (open access)
Clever Algorithms: Nature-Inspired Programming Recipes by Jason Brownlee - GitHub
Metaheuristics: from design to implementation by El-Ghazali Talbi - Wiley

Dynamic Programming and Reinforcement Learning #

Dynamic programming and reinforcement learning address sequential decision-making problems, breaking them into subproblems or learning optimal policies through interaction with environments. These methods are critical in robotics, finance, and AI.

Various tiltes on Dynamic Programming, Optimal Control and Reinforcement Learning by Dimitri Bertsekas. - List
Reinforcement Learning: An Introduction (2nd Edition) by Richard Sutton and Andrew Barto - PDF
Decision Making Under Uncertainty: Theory and Application by Mykel J. Kochenderfer - PDF
Algorithms for Decision Making by Mykel J. Kochenderfer, Tim A. Wheeler, and Kyle H. Wray - PDF

Constraint Programming #

Constraint programming solves problems by defining constraints that must be satisfied, often used in scheduling, planning, and configuration tasks. It excels in problems with complex logical constraints and discrete variables.

Handbook of Constraint Programming by Francesca Rossi, Peter van Beek and Toby Walsh - Amazon
A Tutorial on Constraint Programming by Barbara M. Smith (University of Leeds) - PDF

Combinatorial Optimization #

Combinatorial optimization focuses on finding optimal solutions in discrete structures, such as graphs or sets, often using algorithms for problems like the traveling salesman or graph coloring, with applications in logistics and network design.

Combinatorial Optimization: Algorithms and Complexity by by Christos H. Papadimitriou and Kenneth Steiglitz - Amazon
Combinatorial Optimization: Theory and Algorithms by Bernhard Korte and Jens Vygen - Springer
A First Course in Combinatorial Optimization by Jon Lee - Amazon

Stochastic Optimization and Control #

Stochastic optimization handles problems with uncertainty or randomness, using probabilistic models to optimize objectives. It is widely applied in machine learning, finance, and operations research for robust decision-making.

Lectures on Stochastic Programming Modeling and Theory (SIAM) - by Shapiro, Dentcheva, and Ruszczynski - PDF
Introductory Lectures on Stochastic Optimization by John C. Duchi - PDF

Useful Resources #

Prof. Nguyen Mau Nam, Convex Analysis - An introduction to convexity and nonsmooth analysis
Ben Recht, arg min
Prof. Dimitri P. Bertsekas, Convex Analysis and Optimization
Prof. Dimitri P. Bertsekas, Nonlinear Programming: 3rd Edition
Off the convex path

Post on Optimization #

Second-order Stochastic Optimization methods for Machine Learning

Thu, 27 Jun 2024 23:14:15 +0800

Analysis of the Hessian #

1. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks #

Year: 2017
Authors: Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, Leon Bottou
ArXiv ID: arXiv:1706.04454
URL: https://arxiv.org/abs/1706.04454

Abstract: We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

Source Code: No explicit source code information found

2. The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size #

Year: 2018
Authors: Vardan Papyan
ArXiv ID: arXiv:1811.07062
URL: https://arxiv.org/abs/1811.07062

Abstract: We apply state-of-the-art tools in modern high-dimensional numerical linear algebra to approximate efficiently the spectrum of the Hessian of modern deepnets, with tens of millions of parameters, trained on real data. Our results corroborate previous findings, based on small-scale networks, that the Hessian exhibits “spiked” behavior, with several outliers isolated from a continuous bulk. We decompose the Hessian into different components and study the dynamics with training and sample size of each term individually.

Source Code: No explicit source code information found

3. PyHessian: Neural Networks Through the Lens of the Hessian #

Year: 2019
Authors: Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney
ArXiv ID: arXiv:1912.07145
URL: https://arxiv.org/abs/1912.07145

Abstract: We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape smoother, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.

Source Code: Mentions ‘available’ in abstract; Mentions ‘open source’ in abstract; Known repository: https://github.com/amirgholami/PyHessian

4. A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization #

Year: 2020
Authors: Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, Vineeth N Balasubramanian
ArXiv ID: arXiv:2012.03801
URL: https://arxiv.org/abs/2012.03801

Abstract: Loss landscape analysis is extremely useful for a deeper understanding of the generalization ability of deep neural network models. In this work, we propose a layerwise loss landscape analysis where the loss surface at every layer is studied independently and also on how each correlates to the overall loss surface. We study the layerwise loss landscape by studying the eigenspectra of the Hessian at each layer. In particular, our results show that the layerwise Hessian geometry is largely similar to the entire Hessian. We also report an interesting phenomenon where the Hessian eigenspectrum of middle layers of the deep neural network are observed to most similar to the overall Hessian eigenspectrum. We also show that the maximum eigenvalue and the trace of the Hessian (both full network and layerwise) reduce as training of the network progresses. We leverage on these observations to propose a new regularizer based on the trace of the layerwise Hessian. Penalizing the trace of the Hessian at every layer indirectly forces Stochastic Gradient Descent to converge to flatter minima, which are shown to have better generalization performance. In particular, we show that such a layerwise regularizer can be leveraged to penalize the middlemost layers alone, which yields promising results. Our empirical studies on well-known deep nets across datasets support the claims of this work

Source Code: No explicit source code information found

Diagonal Scaling #

1. AdaHessian: An Adaptive Second Order Optimizer for Machine Learning #

Year: 2020
Authors: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
ArXiv ID: arXiv:2006.00719
Algorithm: AdaHessian
URL: https://arxiv.org/abs/2006.00719

Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.

Source Code: Known repository: https://github.com/amirgholami/adahessian

2. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training #

Year: 2023
Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma
ArXiv ID: arXiv:2305.14342
Algorithm: Sophia
URL: https://arxiv.org/abs/2305.14342

Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.

Source Code: Known repository: https://github.com/Liuhong99/Sophia

Hessian-free Optimization #

1. Learning Recurrent Neural Networks with Hessian-Free Optimization #

Year: 2011
Authors: James Martens, Ilya Sutskever
ArXiv ID:
URL: https://www.cs.toronto.edu/~jmartens/docs/RNN_HF.pdf

Abstract: In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies. Utilizing recent advances in the Hessian-free optimization approach (Martens, 2010), together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems. First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely long-term dependencies), and second, on three natural and highly complex real-world sequence datasets where we find that our method significantly outperforms the previous state-of-the-art method for training neural sequence models: the Long Short-term Memory approach of Hochreiter and Schmidhuber (1997). Additionally, we offer a new interpretation of the generalized Gauss-Newton matrix of Schraudolph (2002) which is used within the HF approach of Martens.

Source Code: No explicit source code information found

2. Training Neural Networks with Stochastic Hessian-Free Optimization #

Year: 2013
Authors: Ryan Kiros
ArXiv ID: arXiv:1301.3641
Algorithm: SHF
URL: https://arxiv.org/abs/1301.3641

Abstract: Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens’ HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.

Source Code: Mentions ‘code’ in abstract

Quasi-Newton #

1. A Stochastic Quasi-Newton Method for Large-Scale Optimization #

Year: 2014
Authors: R.H. Byrd, S.L. Hansen, J. Nocedal, Y. Singer
ArXiv ID: arXiv:1401.7020
URL: https://arxiv.org/abs/1401.7020

Abstract: The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.

Source Code: No explicit source code information found

2. A Multi-Batch L-BFGS Method for Machine Learning #

Year: 2016
Authors: Albert S. Berahas, Jorge Nocedal, Martin Takáč
ArXiv ID: arXiv:1605.06049
URL: https://arxiv.org/abs/1605.06049

Abstract: The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

Source Code: No explicit source code information found

3. Stochastic Quasi-Newton with Line-Search Regularization #

Year: 2019
Authors: Adrian Wills, Thomas Schön
ArXiv ID: arXiv:1909.01238
Algorithm: SQN
URL: https://arxiv.org/abs/1909.01238

Abstract: In this paper we present a novel quasi-Newton algorithm for use in stochastic optimisation. Quasi-Newton methods have had an enormous impact on deterministic optimisation problems because they afford rapid convergence and computationally attractive algorithms. In essence, this is achieved by learning the second-order (Hessian) information based on observing first-order gradients. We extend these ideas to the stochastic setting by employing a highly flexible model for the Hessian and infer its value based on observing noisy gradients. In addition, we propose a stochastic counterpart to standard line-search procedures and demonstrate the utility of this combination on maximum likelihood identification for general nonlinear state space models.

Source Code: No explicit source code information found

4. Practical Quasi-Newton Methods for Training Deep Neural Networks #

Year: 2020
Authors: Donald Goldfarb, Yi Ren, Achraf Bahamou
ArXiv ID: arXiv:2006.08877
URL: https://arxiv.org/abs/2006.08877

Abstract: We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n \times n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

Source Code: Mentions ‘code’ in abstract; Mentions ‘implementation’ in abstract

Gauss-Newton #

1. Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks #

Year: 2019
Authors: Yi Ren, Donald Goldfarb
ArXiv ID: arXiv:1906.02353
Algorithm: SWM-GN, SWM-NG
URL: https://arxiv.org/abs/1906.02353

Abstract: We present practical Levenberg-Marquardt variants of Gauss-Newton and natural gradient methods for solving non-convex optimization problems that arise in training deep neural networks involving enormous numbers of variables and huge data sets. Our methods use subsampled Gauss-Newton or Fisher information matrices and either subsampled gradient estimates (fully stochastic) or full gradients (semi-stochastic), which, in the latter case, we prove convergent to a stationary point. By using the Sherman-Morrison-Woodbury formula with automatic differentiation (backpropagation) we show how our methods can be implemented to perform efficiently. Finally, numerical results are presented to demonstrate the effectiveness of our proposed methods.

Source Code: No explicit source code information found

2. On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs #

Year: 2020
Authors: Matilde Gargiani, et al.
ArXiv ID: arXiv:2006.02409
Algorithm: SGN
URL: https://arxiv.org/abs/2006.02409

Abstract: Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training DNNs. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to compute an approximate search direction, relies on the conjugate gradient method combined with forward and reverse automatic differentiation. Despite the success of SGD and its first-order variants, and despite Hessian-free methods based on the Gauss-Newton Hessian approximation having been already theoretically proposed as practical methods for training DNNs, we believe that SGN has a lot of undiscovered and yet not fully displayed potential in big mini-batch scenarios. For this setting, we demonstrate that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime. This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform, which, unlike Tensorflow and Pytorch, supports forward automatic differentiation. This enables researchers to further study and improve this promising optimization technique and hopefully reconsider stochastic second-order methods as competitive optimization techniques for training DNNs; we also hope that the promise of SGN may lead to forward automatic differentiation being added to Tensorflow or Pytorch. Our results also show that in big mini-batch scenarios SGN is more robust than SGD with respect to its hyperparameters (we never had to tune its step-size for our benchmarks!), which eases the expensive process of hyperparameter tuning that is instead crucial for the performance of first-order methods.

Source Code: Mentions ‘implementation’ in abstract

3. Stochastic Gauss-Newton Algorithms for Nonconvex Compositional Optimization #

Year: 2020
Authors: Quoc Tran-Dinh, et al.
ArXiv ID: arXiv:2002.07290
Algorithm: SGN with SARAH estimators
URL: https://arxiv.org/abs/2002.07290

Abstract: We develop two new stochastic Gauss-Newton algorithms for solving a class of non-convex stochastic compositional optimization problems frequently arising in practice. We consider both the expectation and finite-sum settings under standard assumptions, and use both classical stochastic and SARAH estimators for approximating function values and Jacobians. In the expectation case, we establish $\mathcal{O}(\varepsilon^{-2})$ iteration-complexity to achieve a stationary point in expectation and estimate the total number of stochastic oracle calls for both function value and its Jacobian, where $\varepsilon$ is a desired accuracy. In the finite sum case, we also estimate $\mathcal{O}(\varepsilon^{-2})$ iteration-complexity and the total oracle calls with high probability. To our best knowledge, this is the first time such global stochastic oracle complexity is established for stochastic Gauss-Newton methods. Finally, we illustrate our theoretical results via two numerical examples on both synthetic and real datasets.

Source Code: No explicit source code information found

4. Nonlinear Least Squares for Large-Scale Machine Learning using Stochastic Jacobian Estimates #

Year: 2021
Authors: Johannes J. Brust
ArXiv ID: arXiv:2107.05598
Algorithm: NLLS1, NLLSL
URL: https://arxiv.org/abs/2107.05598

Abstract: For large nonlinear least squares loss functions in machine learning we exploit the property that the number of model parameters typically exceeds the data in one batch. This implies a low-rank structure in the Hessian of the loss, which enables effective means to compute search directions. Using this property, we develop two algorithms that estimate Jacobian matrices and perform well when compared to state-of-the-art methods.

Source Code: No explicit source code information found

5. Improving Levenberg-Marquardt Algorithm for Neural Networks #

Year: 2022
Authors: Omead Pooladzandi, Yiming Zhou
ArXiv ID: arXiv:2212.08769
Algorithm: LM
URL: https://arxiv.org/abs/2212.08769

Abstract: We explore the usage of the Levenberg-Marquardt (LM) algorithm for regression (non-linear least squares) and classification (generalized Gauss-Newton methods) tasks in neural networks. We compare the performance of the LM method with other popular first-order algorithms such as SGD and Adam, as well as other second-order algorithms such as L-BFGS , Hessian-Free and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance.

Source Code: No explicit source code information found

6. Rethinking Gauss-Newton for learning over-parameterized models #

Year: 2023
Authors: Michael Arbel, et al.
ArXiv ID: arXiv:2302.02904
URL: https://arxiv.org/abs/2302.02904

Abstract: This work studies the global convergence and implicit bias of Gauss Newton’s (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN’s method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.

Source Code: No explicit source code information found

7. Exact Gauss-Newton Optimization for Training Deep Neural Networks #

Year: 2024
Authors: Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon
ArXiv ID: arXiv:2405.14402
Algorithm: EGN
URL: https://arxiv.org/abs/2405.14402

Abstract: We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $\epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.

Source Code: No explicit source code information found

Fisher Information #

1. Optimizing Neural Networks with Kronecker-factored Approximate Curvature #

Year: 2015
Authors: James Martens, Roger Grosse
ArXiv ID: arXiv:1503.05671
Algorithm: K-FAC
URL: https://arxiv.org/abs/1503.05671

Abstract: We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network’s Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC’s approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

Source Code: Known repository: Various implementations available

Other #

1. Second-order optimization with lazy Hessians #

Year: 2022
Authors: Nikita Doikov, El Mahdi Chayti, Martin Jaggi
ArXiv ID: arXiv:2212.00781
URL: https://arxiv.org/abs/2212.00781

Abstract: We analyze Newton’s method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every $d$ iterations, where $d$ is the dimension of the problem. This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.

Source Code: No explicit source code information found

Machine Learning & Combinatorial Optimization

Sat, 08 Apr 2023 00:00:00 +0000

A comprehensive overview of machine learning approaches and techniques applied to combinatorial optimization problems, covering foundational concepts, methodologies, and state-of-the-art advances.

Scope: Systematic review of learning-based CO solving methods including supervised learning for heuristics, reinforcement learning for search policies, and hybrid approaches combining classical and neural methods.

Graph Matching #

The problem of finding correspondences between vertices in two graphs, with applications in pattern recognition, shape analysis, and image matching. Deep learning methods have enabled scalable solutions for large graphs.

Definition: Given graphs $G_1 = (V_1, E_1)$ and $G_2 = (V_2, E_2)$, find a correspondence $\pi: V_1 \to V_2$ that maximizes structural similarity, typically measured by the number of preserved edge relationships or minimizing matching cost.

Quadratic Assignment Problem #

An NP-hard optimization problem that assigns n facilities to n locations to minimize total cost, where costs depend on pairwise assignments. Classical applications include facility layout and keyboard design.

Formulation: Minimize $\sum_{i=1}^{n} \sum_{j=1}^{n} c_{ij} x_{\pi(i)j}$ where $\pi$ is a permutation of locations, subject to assignment constraints where each facility is assigned to exactly one location.

Travelling Salesman Problem #

One of the most studied combinatorial optimization problems, seeking the shortest route visiting all cities exactly once. Neural approaches and learning-based heuristics have shown competitive performance compared to traditional methods.

Formulation: Minimize $\sum_{i=1}^{n} d(c_{\pi(i)}, c_{\pi(i+1 \bmod n)})$ where $\pi$ is a permutation of $n$ cities and $d$ is the distance function, subject to visiting each city exactly once.

Portfolio Optimization #

Financial optimization for asset allocation, determining optimal portfolio composition to maximize returns while managing risk and satisfying investment constraints.

Formulation: Maximize $\mathbf{w}^T \boldsymbol{\mu} - \lambda \mathbf{w}^T \Sigma \mathbf{w}$ subject to $\sum w_i = 1$ and $w_i \geq 0$, where $\mathbf{w}$ are weights, $\boldsymbol{\mu}$ expected returns, $\Sigma$ covariance matrix, and $\lambda$ risk aversion.

Maximal Cut #

The problem of partitioning graph vertices into two sets to maximize edges between partitions. A fundamental graph problem with applications in circuit design and network optimization.

Formulation: Partition vertices $V$ into disjoint sets $S$ and $\bar{S}$ to maximize $|{(u,v) \in E : u \in S, v \in \bar{S}}|$, or equivalently maximize $\sum_{(u,v) \in E} x_u(1-x_v)$ where $x_i \in {0,1}$.

Vehicle Routing Problem #

Optimizing routes for a fleet of vehicles to serve customers with minimum distance/cost. Extensions include time windows, capacity constraints, and multiple depots, common in logistics and delivery services.

Formulation: Minimize $\sum_{k=1}^{K} \sum_{i,j} c_{ij} x_{ijk}$ subject to each customer visited by exactly one vehicle, vehicle capacity constraints $\sum_{i \in R_k} d_i \leq C_k$, and flow conservation constraints where $x_{ijk}$ indicates if vehicle $k$ travels from $i$ to $j$.

Job Shop Scheduling Problem #

Scheduling jobs on machines to minimize completion time while respecting precedence and machine constraints. A fundamental problem in manufacturing and production planning.

Formulation: Minimize makespan $C_{max}$ subject to: each job $j$ consists of operations that must be processed in order on specified machines, each machine can process at most one operation at a time, and operation durations are fixed.

Maximum Independent Set #

Finding the largest set of vertices with no edges between them in a graph. An NP-hard problem with applications in scheduling, coding theory, and network design.

Formulation: Maximize $\sum_{i=1}^{n} x_i$ subject to $x_i + x_j \leq 1$ for all $(i,j) \in E$ and $x_i \in {0,1}$, where $x_i = 1$ if vertex $i$ is in the set.

Generalization #

Studying how machine learning solvers generalize across different problem instances and scales, and developing methods that handle adversarial or out-of-distribution scenarios.

Definition: Train model $\theta$ on distribution $D_{train}$ minimizing $\mathbb{E}{\mathbf{x} \sim D{train}}[\ell(f_\theta(\mathbf{x}), y^)]$ such that test error $\mathbb{E}{\mathbf{x} \sim D{test}}[\ell(f_\theta(\mathbf{x}), y^)]$ remains small for $D_{test}$ different from $D_{train}$ (different sizes, perturbations).

Orienteering Problem #

A variant of the traveling salesman problem where a subset of vertices must be selected to maximize profit while respecting a distance constraint. Applications include tourist route planning and project selection.

Formulation: Maximize $\sum_{i \in S} p_i$ subject to the total travel distance $\sum_{i,j \in S} d_{ij} \leq L$ where $S \subseteq V$ is selected vertices, $p_i$ are profits, and $L$ is distance limit.

Knapsack #

The problem of selecting items with given weights and values to maximize total value within a weight capacity. A fundamental dynamic programming problem with numerous variants (0/1, bounded, unbounded).

Formulation: Maximize $\sum_{i=1}^{n} v_i x_i$ subject to $\sum_{i=1}^{n} w_i x_i \leq W$ and $x_i \in {0,1}$, where $v_i$ are values, $w_i$ are weights, and $W$ is capacity.

Computing Resource Allocation #

Optimal allocation of computational resources (CPU, memory, bandwidth) across tasks or virtual machines to maximize utilization while meeting performance requirements.

Formulation: Maximize $\sum_{t=1}^{T} u_t$ subject to $\sum_{t \in task_i} r_{t,d} \leq R_{i,d}$ for each device $d$, latency constraints $L_t \leq L_{max,t}$, where $u_t$ is utility and $r_{t,d}$ is resource $d$ for task $t$.

Bin Packing Problem #

Packing items of varying sizes into a minimum number of bins, a classic problem in logistics and resource management. Variants include 2D and 3D packing with practical applications in shipping and manufacturing.

Formulation: Minimize $\sum_{b=1}^{B} y_b$ subject to $\sum_{i \in b} s_i \leq C \cdot y_b$ for each bin $b$, where $s_i$ is item size, $C$ is bin capacity, $y_b \in {0,1}$ indicates if bin is used.

Graph Edit Distance #

Measuring the dissimilarity between two graphs as the minimum cost of edit operations (insertions, deletions, substitutions) needed to transform one into another. Used in pattern recognition and molecule comparison.

Definition: $GED(G_1, G_2) = \min_{\xi} \sum_{op \in \xi} cost(op)$ where $\xi$ is an edit path transforming $G_1$ to $G_2$, and cost is the sum of operation costs (vertex/edge insertion, deletion, substitution).

Hamiltonian Cycle Problem #

Finding a cycle that visits every vertex exactly once in an undirected graph. A fundamental NP-complete problem related to the traveling salesman problem.

Definition: Determine if there exists a cycle in graph $G = (V, E)$ that visits every vertex in $V$ exactly once. Decision problem: is a Hamiltonian cycle present?

Graph Coloring #

Assigning colors to vertices such that no adjacent vertices share the same color, using the minimum number of colors. Applications include scheduling, register allocation, and map coloring.

Formulation: Minimize $k$ such that $c: V \to {1, …, k}$ where $c(u) \neq c(v)$ for all $(u,v) \in E$, i.e., find the chromatic number $\chi(G)$.

Maximal Common Subgraph #

Finding the largest subgraph isomorphic to both input graphs, useful in molecular structure comparison and pattern discovery applications.

Definition: Find subgraph $G_{mcs} = (V_{mcs}, E_{mcs})$ that is isomorphic to subgraphs of both $G_1$ and $G_2$, maximizing $|V_{mcs}|$ (or $|E_{mcs}|$).

Influence Maximization #

Selecting a subset of nodes in a social network to maximize the spread of information or influence through the network. A key problem in viral marketing and network analysis.

Formulation: Select subset $S \subseteq V$ with $|S| \leq k$ to maximize expected spread $f(S)$, where $f(S) = E[|T(S)|]$ is the expected number of influenced nodes given initial set $S$.

Boolean Satisfiability #

Determining if a boolean formula can be satisfied, one of the most studied NP-complete problems. Recent neural approaches have shown promise for both solving and reasoning about SAT instances.

Definition: Given boolean formula $\phi$ in conjunctive normal form (CNF) with $m$ clauses over $n$ variables, determine if there exists an assignment $\mathbf{x} \in {0,1}^n$ such that $\phi(\mathbf{x}) = \text{true}$.

Max Clique #

Finding the largest clique (complete subgraph) in an undirected graph. An NP-hard problem with applications in social network analysis and bioinformatics.

Formulation: Maximize $\sum_{i=1}^{n} x_i$ subject to $x_i + x_j \leq 1 + \mathbb{1}_{(i,j) \in E}$ for all $i < j$ and $x_i \in {0,1}$, finding largest complete subgraph.

Mixed Integer Programming #

Optimizing linear objective functions subject to linear constraints where some variables must be integers. A general framework encompassing many CO problems, widely used in operations research.

Formulation: Minimize $\mathbf{c}^T \mathbf{x}$ subject to $A\mathbf{x} \leq \mathbf{b}$, $\mathbf{x} \geq \mathbf{0}$, and $x_i \in \mathbb{Z}$ for $i \in I$, where $I$ indicates integer-constrained variables.

Causal Discovery #

Learning the underlying causal structure from observational data, identifying causal relationships between variables. Important for understanding complex systems in medicine, economics, and science.

Definition: Learn directed acyclic graph (DAG) $G = (V, E)$ from observational data where edge $(i \to j) \in E$ indicates $i$ causally influences $j$. Goal: identify true DAG $G^*$ minimizing score $S(G | \mathbf{D})$ subject to acyclicity constraint.

Game Theoretic Semantics #

A game-based interpretation of logical formulas where truth is determined by winning strategies in semantic games, providing computational game-theoretic perspectives on logic and reasoning.

Definition: For formula $\phi$ in language $L$, define semantic game where two players (verifier and falsifier) move according to formula structure. Formula is true in structure if verifier has winning strategy.

Differentiable Optimization #

Making optimization layers differentiable so they can be embedded in neural networks, enabling end-to-end learning where optimization problems become trainable components of deep models.

Formulation: Given parametric optimization problem $y^* = \arg\min_y f(y; \theta)$, compute implicit gradient $\frac{\partial y^}{\partial \theta}$ using implicit differentiation: $\nabla_\theta y^ = -[\nabla_y^2 f]^{-1} \nabla_{\theta,y}^2 f$ enabling backpropagation through optimizer.

Car Dispatch #

Optimally assigning vehicles to passenger requests in ride-hailing and autonomous driving systems, minimizing empty miles and response times.

Formulation: Assign requests $R$ to vehicles $V$ minimizing $\sum_{r \in R} (\alpha \cdot ETA_r + \beta \cdot det_{r})$ subject to vehicle capacity $|A_v| \leq C_v$, time window constraints on pickups/dropoffs, and driver constraints.

Conjunctive Query Containment #

A fundamental problem in database theory and reasoning, determining whether one query result is guaranteed to be a subset of another query’s result.

Definition: Given conjunctive queries $Q_1, Q_2$ over schema, determine if $\text{ans}(Q_1, I) \subseteq \text{ans}(Q_2, I)$ for all possible database instances $I$. Equivalently, check if there exists homomorphism from $Q_2$ to $Q_1$.

Virtual Network Embedding #

Mapping virtual network components (nodes and links) onto physical infrastructure, optimizing resource utilization and quality of service in cloud computing and network management.

Formulation: Map virtual network $G_v = (V_v, E_v)$ to substrate network $G_s = (V_s, E_s)$ by finding embedding $e_n: V_v \to V_s$ and $e_l: E_v \to P(E_s)$ minimizing resource usage while ensuring capacity constraints.

Predict+Optimize #

Decision-focused learning that integrates prediction and optimization into a unified framework, optimizing predictions for decision quality rather than traditional accuracy metrics.

Formulation: Train predictor $f_\theta$ to minimize task loss $\mathcal{L}(y^(f_\theta(\mathbf{x})), y^{opt}) = \mathcal{L}(\arg\min_y f(y; f\theta(\mathbf{x})), y^_{opt})$, where $y^_{opt}$ is optimal decision under true parameters, using implicit differentiation through optimization layer.

Optimal Power Flow #

Determining optimal setpoints for generators in power systems to supply electricity while minimizing costs and satisfying physical constraints, fundamental for smart grid management.

Formulation: Minimize $\sum_{g=1}^{G} (a_g + b_g P_g + c_g P_g^2)$ subject to power balance $P_i = \sum_g P_g - L_i$, voltage constraints $|V_i| \in [V_{min}, V_{max}]$, and transmission limits.

Facility Location Problem #

Determining optimal locations for facilities (warehouses, hospitals, schools) to serve customers, minimizing total distance and facility opening costs.

Formulation: Minimize $\sum_{j=1}^{m} f_j y_j + \sum_{i=1}^{n} \sum_{j=1}^{m} c_{ij} x_{ij}$ subject to $\sum_j x_{ij} = 1$ (serve all customers), $x_{ij} \leq y_j$ (assignment constraints), and $y_j \in {0,1}$.

Sorting & Ranking #

Differentiable sorting and ranking operations that can be integrated into neural networks, enabling permutation-based learning and differentiable ranking optimization.

Definition: Approximate permutation matrix $P \in \mathbb{R}^{n \times n}$ where $P\mathbf{x}$ sorts vector $\mathbf{x}$ in differentiable manner, or compute ranking scores $r_i$ for items proportional to quality or preference.

Combinatorial Drug Recommendation #

Finding optimal combinations of drugs to maximize therapeutic efficacy while minimizing adverse interactions, a key application in personalized medicine and drug discovery.

Formulation: Select drug subset $S \subseteq D$ to maximize efficacy $f(S)$ subject to safety constraint (drug interactions) $g(S) \leq \epsilon$ and cardinality limit $|S| \leq k$.

Stochastic Combinatorial Optimization #

Addressing CO problems with random or uncertain parameters, developing robust or adaptive solutions that perform well under uncertainty and variability.

Formulation: Minimize $\mathbb{E}[f(\mathbf{x}, \boldsymbol{\xi})]$ over decision $\mathbf{x} \in X$ where $\boldsymbol{\xi}$ is random parameter vector, or find robust solution $\mathbb{x}^* = \arg\min_\mathbf{x} \max_{\boldsymbol{\xi} \in U} f(\mathbf{x}, \boldsymbol{\xi})$.

Vertex Cover #

Finding the minimum set of vertices that covers all edges in a graph. A fundamental NP-hard problem with applications in network design and bioinformatics.

Formulation: Minimize $\sum_{i=1}^{n} x_i$ subject to $x_i + x_j \geq 1$ for all $(i,j) \in E$ and $x_i \in {0,1}$, where $x_i = 1$ if vertex $i$ is in the cover.