Optimization on Nam Le

Optimization Papers in JMLR Volume 26

Sun, 29 Sep 2024 00:00:00 +0000

Optimization Research Papers in JMLR Volume 25

Sun, 29 Sep 2024 00:00:00 +0000

Optimization Research Papers in JMLR Volume 25 (2024) #

This document lists papers from JMLR Volume 25 (2024) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.

Convex Optimization #

Papers addressing convex optimization problems, including sparse NMF, differential privacy, and sparse regression.

Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
Authors: Yuze Han, Guangzeng Xie, Zhihua Zhang
Description: Investigates lower complexity bounds for finite-sum optimization problems in convex settings.
Sparse NMF with Archetypal Regularization: Computational and Robustness Properties
Authors: Kayhan Behdin, Rahul Mazumder
Description: Proposes sparse non-negative matrix factorization with archetypal regularization using convex optimization.
Scaling the Convex Barrier with Sparse Dual Algorithms
Authors: Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar
Description: Develops sparse dual algorithms for scaling convex optimization problems.
Faster Rates in Differentially Private Stochastic Convex Optimization
Authors: Jinyan Su, Lijie Hu, Di Wang
Description: Analyzes faster convergence rates for differentially private stochastic convex optimization.
Estimation of Sparse Gaussian Graphical Models with Hidden Clustering Structure
Authors: Meixia Lin, Defeng Sun, Kim-Chuan Toh, Chengjing Wang
Description: Develops convex optimization methods for sparse Gaussian graphical models with hidden clustering.
A Minimax Optimal Approach to High-Dimensional Double Sparse Linear Regression
Authors: Yanhang Zhang, Zhifan Li, Shixiang Liu, Jianxin Yin
Description: Proposes a minimax optimal approach for high-dimensional double sparse linear regression using convex optimization.
An Inexact Projected Regularized Newton Method for Fused Zero-Norms Regularization Problems
Authors: Yuqia Wu, Shaohua Pan, Xiaoqi Yang
Description: Introduces an inexact projected regularized Newton method for fused zero-norms regularization in convex optimization.

Nonconvex Optimization #

Papers tackling nonconvex optimization, focusing on ADMM, Adam-family methods, and stochastic minimax optimization.

Convergence for Nonconvex ADMM, with Applications to CT Imaging
Authors: Rina Foygel Barber, Emil Y. Sidky
Description: Studies convergence properties of nonconvex ADMM with applications to CT imaging.
Adam-Family Methods for Nonsmooth Optimization with Convergence Guarantees
Authors: Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh
Description: Develops Adam-family methods for nonsmooth nonconvex optimization with convergence guarantees.
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo under Local Conditions for Nonconvex Optimization
Authors: O. Deniz Akyildiz, Sotirios Sabanis
Description: Provides a nonasymptotic analysis of stochastic gradient Hamiltonian Monte Carlo for nonconvex optimization.
High Probability Convergence Bounds for Non-Convex Stochastic Gradient Descent with Sub-Weibull Noise
Authors: Liam Madden, Emiliano Dall’Anese, Stephen Becker
Description: Derives high-probability convergence bounds for nonconvex stochastic gradient descent with sub-Weibull noise.
Stochastic Regularized Majorization-Minimization with Weakly Convex and Multi-Convex Surrogates
Authors: Hanbaek Lyu
Description: Proposes stochastic regularized majorization-minimization for weakly convex and multi-convex problems.
Near-Optimal Algorithms for Stochastic Minimax Optimization
Authors: Lesi Chen, Luo Luo
Description: Develops near-optimal algorithms for stochastic minimax optimization in nonconvex settings.
Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks
Authors: Naoki Sato, Koshiro Izumi, Hideaki Iiduka
Description: Introduces a scaled conjugate gradient method for nonconvex optimization in deep neural networks.

Stochastic Optimization #

Papers focusing on stochastic optimization methods, including continuous-time approximations, momentum, and curvature estimates.

A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent
Authors: Stefan Ankirchner, Stefan Perko
Description: Compares continuous-time approximations to stochastic gradient descent for optimization.
On the Generalization of Stochastic Gradient Descent with Momentum
Authors: Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang
Description: Analyzes the generalization properties of stochastic gradient descent with momentum.
Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent
Authors: Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi
Description: Studies stochastic modified flows and mean-field limits for stochastic gradient descent dynamics.
Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality
Authors: Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy
Description: Investigates stochastic approximation with decision-dependent distributions, focusing on asymptotic normality and optimality.
An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization
Authors: Guy Kornowski, Ohad Shamir
Description: Proposes an algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.
On the Hyperparameters in Stochastic Gradient Descent with Momentum
Authors: Bin Shi
Description: Examines the impact of hyperparameters in stochastic gradient descent with momentum.
Almost Sure Convergence Rates Analysis and Saddle Avoidance of Stochastic Gradient Methods
Authors: Jun Liu, Ye Yuan
Description: Analyzes almost sure convergence rates and saddle avoidance in stochastic gradient methods.
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates
Authors: Zachary Frangella, Pratik Rathore, Shipu Zhao, Madeleine Udell
Description: Introduces preconditioned stochastic optimization methods with scalable curvature estimates.
Zeroth-Order Stochastic Approximation Algorithms for DR-Submodular Optimization
Authors: Yuefang Lian, Xiao Wang, Dachuan Xu, Zhongrui Zhao
Description: Develops zeroth-order stochastic approximation algorithms for DR-submodular optimization.
Stochastic-Constrained Stochastic Optimization with Markovian Data
Authors: Yeongjong Kim, Dabeen Lee
Description: Studies stochastic-constrained optimization with Markovian data.
High Probability and Risk-Averse Guarantees for a Stochastic Accelerated Primal-Dual Method
Authors: Yassine Laguel, Necdet Serhat Aybat, Mert Gürbüzbalaban
Description: Provides high-probability and risk-averse guarantees for a stochastic accelerated primal-dual method.

Distributed/Decentralized Optimization #

Papers addressing distributed or decentralized optimization algorithms, focusing on communication efficiency and federated learning.

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms
Authors: T. Tony Cai, Hongji Wei
Description: Develops optimal rates and communication-efficient algorithms for distributed Gaussian mean estimation.
Accelerated Gradient Tracking over Time-Varying Graphs for Decentralized Optimization
Authors: Huan Li, Zhouchen Lin
Description: Proposes accelerated gradient tracking for decentralized optimization over time-varying graphs.
Compressed and Distributed Least-Squares Regression: Convergence Rates with Applications to Federated Learning
Authors: Constantin Philippenko, Aymeric Dieuleveut
Description: Analyzes convergence rates for compressed and distributed least-squares regression in federated learning.
Federated Automatic Differentiation
Authors: Keith Rush, Zachary Charles, Zachary Garrett
Description: Introduces federated automatic differentiation for distributed optimization.
A Random Projection Approach to Personalized Federated Learning: Enhancing Communication Efficiency, Robustness, and Fairness
Authors: Yuze Han, Xiang Li, Shiyun Lin, Zhihua Zhang
Description: Proposes a random projection approach to enhance communication efficiency in personalized federated learning.
Countering the Communication Bottleneck in Federated Learning: A Highly Efficient Zero-Order Optimization Technique
Authors: Elissa Mhanna, Mohamad Assaad
Description: Develops a zero-order optimization technique to address communication bottlenecks in federated learning.

Bandits and Online Learning #

Papers addressing multi-armed bandits, online optimization, and regret minimization.

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment
Authors: Zixian Yang, Xin Liu, Lei Ying
Description: Studies exploration, exploitation, and engagement in multi-armed bandits with abandonment.
Adaptivity and Non-Stationarity: Problem-Dependent Dynamic Regret for Online Convex Optimization
Authors: Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou
Description: Analyzes problem-dependent dynamic regret for online convex optimization under non-stationarity.
Materials Discovery Using Max K-Armed Bandit
Authors: Nobuaki Kikkawa, Hiroshi Ohno
Description: Applies max k-armed bandit algorithms to materials discovery, focusing on regret minimization.
Finite-Time Analysis of Globally Nonstationary Multi-Armed Bandits
Authors: Junpei Komiyama, Edouard Fouché, Junya Honda
Description: Provides finite-time analysis for globally nonstationary multi-armed bandits.
Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization
Authors: Sijia Chen, Yu-Jie Zhang, Wei-Wei Tu, Peng Zhao, Lijun Zhang
Description: Develops optimistic online mirror descent for bridging stochastic and adversarial online convex optimization.
Continuous Prediction with Experts’ Advice
Authors: Nicholas J. A. Harvey, Christopher Liaw, Victor S. Portella
Description: Investigates continuous prediction with experts’ advice in online learning settings.
Regret Analysis of Bilateral Trade with a Smoothed Adversary
Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, Stefano Leonardi
Description: Analyzes regret in bilateral trade with a smoothed adversary in online optimization.
Optimal Learning Policies for Differential Privacy in Multi-Armed Bandits
Authors: Siwei Wang, Jun Zhu
Description: Develops optimal learning policies for differential privacy in multi-armed bandits.
Information Capacity Regret Bounds for Bandits with Mediator Feedback
Authors: Khaled Eldowa, Nicolò Cesa-Bianchi, Alberto Maria Metelli, Marcello Restelli
Description: Derives regret bounds for bandits with mediator feedback, focusing on information capacity.
Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression
Authors: Aleksandrs Slivkins, Xingyu Zhou, Karthik Abinav Sankararaman, Dylan J. Foster
Description: Proposes a modular Lagrangian approach for contextual bandits with packing and covering constraints.

Optimization in Reinforcement Learning #

Papers focusing on optimization techniques for reinforcement learning, including policy gradient, actor-critic, and safe RL.

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization
Authors: Shicong Cen, Yuting Wei, Yuejie Chi
Description: Develops fast policy extragradient methods for competitive games with entropy regularization in RL.
Sample-Efficient Adversarial Imitation Learning
Authors: Dahuin Jung, Hyungyu Lee, Sungroh Yoon
Description: Proposes sample-efficient adversarial imitation learning methods for RL optimization.
On the Sample Complexity and Metastability of Heavy-Tailed Policy Search in Continuous Control
Authors: Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel
Description: Analyzes sample complexity and metastability for heavy-tailed policy search in continuous control.
Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning
Authors: Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman
Description: Develops off-policy action anticipation methods for multi-agent RL optimization.
Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Authors: Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup
Description: Investigates policy gradient methods with symmetries and state abstractions for RL optimization.
Log Barriers for Safe Black-Box Optimization with Application to Safe Reinforcement Learning
Authors: Ilnura Usmanova, Yarden As, Maryam Kamgarpour, Andreas Krause
Description: Proposes log barriers for safe black-box optimization with applications to safe RL.
Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning
Authors: Jinchi Chen, Jie Feng, Weiguo Gao, Ke Wei
Description: Develops decentralized natural policy gradient with variance reduction for multi-agent RL.
Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity
Authors: Laixi Shi, Yuejie Chi
Description: Studies distributionally robust model-based offline RL with near-optimal sample complexity.
Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds
Authors: Zhenghao Xu, Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao
Description: Analyzes sample complexity of neural policy mirror descent for policy optimization on low-dimensional manifolds.
Mean-Field Approximation of Cooperative Constrained Multi-Agent Reinforcement Learning (CMARL)
Authors: Washim Uddin Mondal, Vaneet Aggarwal, Satish V. Ukkusuri
Description: Proposes mean-field approximations for cooperative constrained multi-agent RL optimization.
Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning
Authors: Luofeng Liao, Zuyue Fu, Zhuoran Yang, Yixin Wang, Dingli Ma, Mladen Kolar, Zhaoran Wang
Description: Develops instrumental variable value iteration for causal offline RL optimization.
Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality
Authors: François G. Ged, Maria Han Veiga
Description: Introduces a Matryoshka policy gradient method for entropy-regularized RL with convergence guarantees.
Data-Efficient Policy Evaluation Through Behavior Policy Search
Authors: Josiah P. Hanna, Yash Chandak, Philip S. Thomas, Martha White, Peter Stone, Scott Niekum
Description: Proposes data-efficient policy evaluation methods for RL through behavior policy search.
Empirical Design in Reinforcement Learning
Authors: Andrew Patterson, Samuel Neumann, Martha White, Adam White
Description: Investigates empirical design strategies for optimization in reinforcement learning.
A New, Physics-Informed Continuous-Time Reinforcement Learning Algorithm with Performance Guarantees
Authors: Brent A. Wallace, Jennie Si
Description: Develops a physics-informed continuous-time RL algorithm with performance guarantees.

Ebooks & related papers on Convex Optimizations

Mon, 15 Jul 2024 00:00:00 +0000

Ebooks #

Boris Mordukhovich , Nguyen Mau Nam. An Easy Path to Convex Analysis and Applications. 2023
Yurii Nesterov. Lectures on Convex Optimization. 2018
Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. 2015
Dimitri Bertsekas. Nonlinear Programming. 2016
Boris Teodorovich Polyak. Introduction to Optimization. 1987
R. T. Rockafellar. Convex Analysis. 1970
H. H. Bauschke & P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2011
Lieven Vandenberghe and Stephen P. Boyd. Convex Optimization. 2004

Papers #

Yu. E. Nesterov. A method of solving a convex programming problem with convergence rate. 1983

Pre-print articles on Adagrad-variant methods

Mon, 15 Jul 2024 00:00:00 +0000

1. Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models #

Authors: Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

Abstract: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.

2. Accelerated Parameter-Free Stochastic Optimization #

Authors: Itai Kreisler, Maor Ivgi, Oliver Hinder, Yair Carmon

Abstract: We propose a method that achieves near-optimal rates for smooth stochastic convex optimization and requires essentially no prior knowledge of problem parameters. This improves on prior work which requires knowing at least the initial distance to optimality d0. Our method, U-DoG, combines UniXGrad (Kavis et al., 2019) and DoG (Ivgi et al., 2023) with novel iterate stabilization techniques. It requires only loose bounds on d0 and the noise magnitude, provides high probability guarantees under sub-Gaussian noise, and is also near-optimal in the non-smooth case. Our experiments show consistent, strong performance on convex problems and mixed results on neural network training.

3. Universal Gradient Methods for Stochastic Convex Optimization #

Authors: Anton Rodomanov, Ali Kavis, Yongtao Wu, Kimon Antonakopoulos, Volkan Cevher

Abstract: We develop universal gradient methods for Stochastic Convex Optimization (SCO). Our algorithms automatically adapt not only to the oracle’s noise but also to the Hölder smoothness of the objective function without a priori knowledge of the particular setting. The key ingredient is a novel strategy for adjusting step-size coefficients in the Stochastic Gradient Method (SGD). Unlike AdaGrad, which accumulates gradient norms, our Universal Gradient Method accumulates appropriate combinations of gradient- and iterate differences. The resulting algorithm has state-of-the-art worst-case convergence rate guarantees for the entire Hölder class including, in particular, both nonsmooth functions and those with Lipschitz continuous gradient. We also present the Universal Fast Gradient Method for SCO enjoying optimal efficiency estimates.

Pre-print articles on Adaptive Optimization

Mon, 15 Jul 2024 00:00:00 +0000

1. A simple uniformly optimal method without line search for convex optimization #

Authors: Tianjiao Li, Guanghui Lan

Abstract: Line search (or backtracking) procedures have been widely employed into first-order methods for solving convex optimization problems, especially those with unknown problem parameters (e.g., Lipschitz constant). In this paper, we show that line search is superfluous in attaining the optimal rate of convergence for solving a convex optimization problem whose parameters are not given a priori. In particular, we present a novel accelerated gradient descent type algorithm called auto-conditioned fast gradient method (AC-FGM) that can achieve an optimal $\mathcal{O}(1/k^2)$ rate of convergence for smooth convex optimization without requiring the estimate of a global Lipschitz constant or the employment of line search procedures. We then extend AC-FGM to solve convex optimization problems with Hölder continuous gradients and show that it automatically achieves the optimal rates of convergence uniformly for all problem classes with the desired accuracy of the solution as the only input. Finally, we report some encouraging numerical results that demonstrate the advantages of AC-FGM over the previously developed parameter-free methods for convex optimization.

Source code: https://github.com/tli432/AC-FGM-Implementation

2. Adaptive Proximal Gradient Method for Convex Optimization #

Authors: Yura Malitsky, Konstantin Mishchenko

Abstract: In this paper, we explore two fundamental first-order algorithms in convex optimization, namely, gradient descent (GD) and proximal gradient method (ProxGD). Our focus is on making these algorithms entirely adaptive by leveraging local curvature information of smooth functions. We propose adaptive versions of GD and ProxGD that are based on observed gradient differences and, thus, have no added computational costs. Moreover, we prove convergence of our methods assuming only local Lipschitzness of the gradient. In addition, the proposed versions allow for even larger stepsizes than those initially suggested in [MM20].

Source code: https://github.com/ymalitsky/AdProxGD

3. An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes #

Authors: Antonio Orvieto, Lin Xiao

Abstract: We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each loss function is non-negative and thus can be expressed as the composition of a square and its real-valued square root. This reformulation allows us to apply the Gauss-Newton method, or the Levenberg-Marquardt method when adding a quadratic regularization. The resulting algorithm, while being computationally as efficient as the vanilla stochastic gradient method, is highly adaptive and can automatically warmup and decay the effective stepsize while tracking the non-negative loss landscape. We provide a tight convergence analysis, leveraging new techniques, in the stochastic convex and non-convex settings. In particular, in the convex case, the method does not require access to the gradient Lipshitz constant for convergence, and is guaranteed to never diverge. The convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method as well as to several other adaptive methods.

4. Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance #

Authors: Antonio Orvieto, Lin Xiao

Abstract: Stochastic gradient descent with momentum, also known as Stochastic Heavy Ball method (SHB), is one of the most popular algorithms for solving large-scale stochastic optimization problems in various machine learning tasks. In practical scenarios, tuning the step-size and momentum parameters of the method is a prohibitively expensive and time-consuming process. In this work, inspired by the recent advantages of stochastic Polyak step-size in the performance of stochastic gradient descent (SGD), we propose and explore new Polyak-type variants suitable for the update rule of the SHB method. In particular, using the Iterate Moving Average (IMA) viewpoint of SHB, we propose and analyze three novel step-size selections: $\text{MomSPS} _{\max}$, $\text{MomDecSPS}$, and $\text{MomAdaSPS}$. For $\text{MomSPS} _{\max}$, we provide convergence guarantees for SHB to a neighborhood of the solution for convex and smooth problems (without assuming interpolation). If interpolation is also satisfied, then using $\text{MomSPS} _{\max}$, SHB converges to the true solution at a fast rate matching the deterministic HB. The other two variants, MomDecSPS and MomAdaSPS, are the first adaptive step-size for SHB that guarantee convergence to the exact minimizer - without a priori knowledge of the problem parameters and without assuming interpolation. Our convergence analysis of SHB is tight and obtains the convergence guarantees of stochastic Polyak step-size for SGD as a special case. We supplement our analysis with experiments validating our theory and demonstrating the effectiveness and robustness of our algorithms.

Where: 13th International Conference on Learning Representations (ICLR 2025)

Source code: https://openreview.net/forum?id=nuX2yPejiL

Pre-print articles on gradient-clipping methods

Mon, 15 Jul 2024 00:00:00 +0000

1. Why gradient clipping accelerates training: A theoretical justification for adaptivity #

Authors: Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie

Abstract: We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.

2. Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees #

Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U. Stich

Abstract: Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of c and strong noise assumptions.

In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds c and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.

3. Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed #

Authors: Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Abstract: Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam-Norm in handling the heavy-tailed noise.

Mathematics - Optimization

Thu, 27 Jun 2024 23:14:15 +0800

Branches of Optimization Research #

Convex Optimization #

Convex optimization focuses on problems where the objective function and constraints are convex, ensuring a single global optimum. This field is foundational in machine learning, signal processing, and control systems due to its guaranteed convergence and efficient algorithms.

Convex Optimization by Boyd and Vandenberghe - PDF
Convex Optimization Theory by Dimitri P. Bertsekas - PDF

Discrete, Combinatorial, and Integer Optimization #

This branch deals with optimization problems involving discrete variables, such as integers or combinatorial structures, often encountered in scheduling, network design, and logistics. Bayesian optimization, a subset, is particularly useful for optimizing expensive black-box functions.

Bayesian Optimization In Action by Quan Nguyen - Amazon
Experimentation for Engineers by David Sweet - Amazon

Operations Research #

Operations research applies mathematical modeling and optimization to complex decision-making in logistics, supply chain, and resource allocation. It integrates techniques like linear programming, simulation, and heuristic methods to optimize real-world systems.

Operations Research An Introduction by Hamdy A. Taha - Pearson
Introduction to Operations Research by Frederick Hillier and Gerald Lieberman - McGraw Hill
Julia Programming for Operations Research by Changhyun Kwon - PDF - code
Mathematical Programming and Operations Research: Modeling, Algorithms, and Complexity. Examples in Python and Julia. Edited by Robert Hildebrand - PDF
A First Course in Linear Optimization by Jon Lee - PDF
Decomposition Techniques in Mathematical Programming by Conejo , Castillo , Mínguez , and García-Bertrand - Springer
Algorithms for Optimization by Mykel J. Kochenderfer and Tim A. Wheeler - PDF
Model Building in Mathematical Programming - Introductory modeling book by H. Paul Williams - Wiley

Meta-heuristics #

Meta-heuristics are high-level strategies for solving complex optimization problems where exact methods are computationally infeasible. They include nature-inspired algorithms like genetic algorithms and simulated annealing, widely used in engineering and data science.

Metaheuristics by Patrick Siarry - Springer (open access)
Essentials of Metaheuristics by Sean Luke - link
Handbook of Metaheuristics by Michel Gendreau and Jean-Yves Potvin - Springer (open access)
An Introduction to Metaheuristics for Optimization by Bastien Chopard , Marco Tomassini - Springer (open access)
Metaheuristic and Evolutionary Computation: Algorithms and Applications by Hasmat Malik, Atif Iqbal, Puneet Joshi, Sanjay Agrawal, and Farhad Ilahi Bakhsh - Springer (open access)
Clever Algorithms: Nature-Inspired Programming Recipes by Jason Brownlee - GitHub
Metaheuristics: from design to implementation by El-Ghazali Talbi - Wiley

Dynamic Programming and Reinforcement Learning #

Dynamic programming and reinforcement learning address sequential decision-making problems, breaking them into subproblems or learning optimal policies through interaction with environments. These methods are critical in robotics, finance, and AI.

Various tiltes on Dynamic Programming, Optimal Control and Reinforcement Learning by Dimitri Bertsekas. - List
Reinforcement Learning: An Introduction (2nd Edition) by Richard Sutton and Andrew Barto - PDF
Decision Making Under Uncertainty: Theory and Application by Mykel J. Kochenderfer - PDF
Algorithms for Decision Making by Mykel J. Kochenderfer, Tim A. Wheeler, and Kyle H. Wray - PDF

Constraint Programming #

Constraint programming solves problems by defining constraints that must be satisfied, often used in scheduling, planning, and configuration tasks. It excels in problems with complex logical constraints and discrete variables.

Handbook of Constraint Programming by Francesca Rossi, Peter van Beek and Toby Walsh - Amazon
A Tutorial on Constraint Programming by Barbara M. Smith (University of Leeds) - PDF

Combinatorial Optimization #

Combinatorial optimization focuses on finding optimal solutions in discrete structures, such as graphs or sets, often using algorithms for problems like the traveling salesman or graph coloring, with applications in logistics and network design.

Combinatorial Optimization: Algorithms and Complexity by by Christos H. Papadimitriou and Kenneth Steiglitz - Amazon
Combinatorial Optimization: Theory and Algorithms by Bernhard Korte and Jens Vygen - Springer
A First Course in Combinatorial Optimization by Jon Lee - Amazon

Stochastic Optimization and Control #

Stochastic optimization handles problems with uncertainty or randomness, using probabilistic models to optimize objectives. It is widely applied in machine learning, finance, and operations research for robust decision-making.

Lectures on Stochastic Programming Modeling and Theory (SIAM) - by Shapiro, Dentcheva, and Ruszczynski - PDF
Introductory Lectures on Stochastic Optimization by John C. Duchi - PDF

Useful Resources #

Prof. Nguyen Mau Nam, Convex Analysis - An introduction to convexity and nonsmooth analysis
Ben Recht, arg min
Prof. Dimitri P. Bertsekas, Convex Analysis and Optimization
Prof. Dimitri P. Bertsekas, Nonlinear Programming: 3rd Edition
Off the convex path

Post on Optimization #

Pre-print articles on Difference-of-Convex (DC) Programming

Thu, 27 Jun 2024 23:14:15 +0800

57. Stochastic Difference-of-Convex Optimization with Momentum #

Authors: El Mahdi Chayti, Martin Jaggi

Abstract: Stochastic difference-of-convex (DC) optimization is prevalent in numerous machine learning applications, yet its convergence properties under small batch sizes remain poorly understood. Existing methods typically require large batches or strong noise assumptions, which limit their practical use. In this work, we show that momentum enables convergence under standard smoothness and bounded variance assumptions (of the concave part) for any batch size. We prove that without momentum, convergence may fail regardless of stepsize, highlighting its necessity. Our momentum-based algorithm achieves provable convergence and demonstrates strong empirical performance.

URL: https://arxiv.org/abs/2510.17503

56. On the convergence rate of the boosted Difference-of-Convex Algorithm (DCA) #

Authors: Hadi Abbaszadehpeivasti, Etienne de Klerk, Adrien Taylor

Abstract: The difference-of-convex algorithm (DCA) is a well-established nonlinear programming technique that solves successive convex optimization problems. These sub-problems are obtained from the difference-of-convex~(DC) decompositions of the objective and constraint functions. We investigate the worst-case performance of the unconstrained DCA, with and without boosting, where boosting simply performs an additional step in the direction generated by the usual DCA method. We show that, for certain classes of DC decompositions, the boosted DCA is provably better in the worst-case than the usual DCA. While several numerical studies have reported that boosted DCA outperforms classical DCA, a theoretical explanation for this behavior has, to the best of our knowledge, not been given until now. Our proof technique relies on semidefinite programming (SDP) performance estimation

URL: https://arxiv.org/abs/2510.16569

55. Global solution algorithms for DC programming via polyhedral approximations of convex functions #

Authors: Fahaar M. Pirani & Firdevs Ulus

Abstract: We consider difference of convex (DC) programming problems and propose three algorithms to solve them globally. The main working mechanism of the proposed algorithms is to generate polyhedral underestimators to convex functions. Two of these algorithms generate a ‘fine’ polyhedral approximation of the first convex component over the compact feasible region of the DC programming problem. We prove the finiteness of these algorithms, establish the convergence rate of one of them. Moreover, we show that using the polyhedral approximation of the first component, it is possible to compute an approximate global solution of the corresponding DC programming problem without further computational effort. The third algorithm also computes a polyhedral underestimator of the first component of the DC function. Different from the first two algorithms, the third algorithm approximates it locally until finding an approximate global solution to the DC programming problem. It is shown that for any positive approximation error, the third algorithm stops after finitely many iterations. Computational results based on some test instances from the literature are provided.

URL: https://link.springer.com/article/10.1007/s10898-025-01535-z

54. Improved Rates for Stochastic Variance-Reduced Difference-of-Convex Algorithms #

Authors: Anh Duc Nguyen, Alp Yurtsever, Suvrit Sra, Kim-Chuan Toh

Abstract: In this work, we propose and analyze DCA-PAGE, a novel algorithm that integrates the difference-of-convex algorithm (DCA) with the ProbAbilistic Gradient Estimator (PAGE) to solve structured nonsmooth difference-of-convex programs. In the finite-sum setting, our method achieves a gradient computation complexity of $O(N + N^{1/2}\varepsilon^{-2})$ with sample size $N$, surpassing the previous best-known complexity of $O(N + N^{2/3}\varepsilon^{-2})$ for stochastic variance-reduced (SVR) DCA methods. Furthermore, DCA-PAGE readily extends to online settings with a similar optimal gradient computation complexity $O(b + b^{1/2}\varepsilon^{-2})$ with batch size $b$, a significant advantage over existing SVR DCA approaches that only work for the finite-sum setting. We further refine our analysis with a gap function, which enables us to obtain comparable convergence guarantees under milder assumptions.

Comment: Accepted at IEEE Conference on Decision and Control (IEEE CDC 2025)

URL: https://arxiv.org/pdf/2509.11657

53. New Algorithms for maximizing the difference of convex functions #

Authors: Aharon Ben-Tal, Luba Tetruashvili

Abstract: Maximizing the difference of 2 convex functions over a convex feasible set (the so called DCA problem) is a hard problem. There is a large number of publications addressing this problem. Many of them are variations of widely used DCA algorithm [20]. The success of this algorithm to reach a good approximation of a global optimum, depends crucially on the choice of its starting point. In the algorithm developed in our paper MDCF (Maximizing the Difference of Convex Functions) a major effort is to generate a good starting point. This is obtained by using the COMAX algorithm for maximizing a convex function [6]. The solution found by COMAX is a basis for obtaining a good strating point for MDCF. Another contribution of the paper is the algorithm for solving problems with an indefinite quadratic objective function and compact and convex feasible set. The problem is first converted to maximizing a difference of convex quadratic functions. The new algorithm QMDCF is a specific adaptation of MDCF to this case. The performance of the two new algorithms developed in the paper is tested numerically, and results are compared to the performance of classical DCA, and some other algorithms.

URL: https://optimization-online.org/2025/04/new-algorithms-for-maximizing-the-difference-of-convex-functions/

52. A progressive decoupling algorithm for minimizing the difference of convex and weakly convex functions #

Authors: Welington de Oliveira & João Carlos de Oliveira Souza

Abstract: Commonly, decomposition and splitting techniques for optimization problems strongly depend on convexity. Implementable splitting methods for nonconvex and nonsmooth optimization problems are scarce and often lack convergence guarantees. Among the few exceptions is the Progressive Decoupling Algorithm (PDA), which has local convergence should convexity be elicitable. In this work, we furnish PDA with a descent test and extend the method to accommodate a broad class of nonsmooth optimization problems with non-elicitable convexity. More precisely, we focus on the problem of minimizing the difference of convex and weakly convex functions over a linear subspace. This framework covers, in particular, a family of stochastic programs with nonconvex recourse and statistical estimation problems for supervised learning.

URL: https://link.springer.com/article/10.1007/s10957-024-02574-4

51. An Inexact Proximal Framework for Nonsmooth Riemannian Difference-of-Convex Optimization [arXiv:2509.08561] #

Authors: Bo Jiang, Meng Xu, Xingju Cai, Ya-Feng Liu

Abstract: Nonsmooth Riemannian optimization has attracted increasing attention, especially in problems with sparse structures. While existing formulations typically involve convex nonsmooth terms, incorporating nonsmooth difference-of-convex (DC) penalties can enhance recovery accuracy. In this paper, we study a class of nonsmooth Riemannian optimization problems whose objective is the sum of a smooth function and a nonsmooth DC term. We establish, for the first time in the manifold setting, the equivalence between such DC formulations (with suitably chosen nonsmooth DC terms) and their $\ell_0$-regularized or $\ell_0$-constrained counterparts. To solve these problems, we propose an inexact Riemannian proximal DC (iRPDC) algorithmic framework, which returns an $\epsilon$-Riemannian critical point within $\mathcal{O}(\epsilon^{-2})$ outer iterations. Within this framework, we develop several practical algorithms based on different subproblem solvers. Among them, one achieves an overall iteration complexity of $\mathcal{O}(\epsilon^{-3})$, which matches the best-known bound in the literature. In contrast, existing algorithms either lack provable overall complexity or require $\mathcal{O}(\epsilon^{-3})$ iterations in both outer and overall complexity. A notable feature of the iRPDC algorithmic framework is a novel inexactness criterion that not only enables efficient subproblem solutions via first-order methods but also facilitates a linesearch procedure that adaptively captures the local curvature. Numerical results on sparse principal component analysis demonstrate the modeling flexibility of the DC formulaton and the competitive performance of the proposed algorithmic framework.

URL: https://arxiv.org/abs/2509.08561

50. Tight Convergence Rates in Gradient Mapping for the Difference-of-Convex Algorithm [arXiv:2506.01791] #

Authors: Teodor Rotaru, Panagiotis Patrinos, François Glineur

Abstract: We establish new theoretical convergence guarantees for the difference-of-convex algorithm (DCA), where the second function is allowed to be weakly-convex, measuring progress via composite gradient mapping. Based on a tight analysis of two iterations of DCA, we identify six parameter regimes leading to sublinear convergence rates toward critical points and establish those rates by proving adapted descent lemmas. We recover existing rates for the standard difference-of-convex decompositions of nonconvex-nonconcave functions, while for all other curvature settings our results are new, complementing recently obtained rates on the gradient residual. Three of our sublinear rates are tight for any number of DCA iterations, while for the other three regimes we conjecture exact rates, using insights from the tight analysis of gradient descent and numerical validation using the performance estimation methodology. Finally, we show how the equivalence between proximal gradient descent (PGD) and DCA allows the derivation of exact PGD rates for any constant stepsize.

URL: https://arxiv.org/abs/2506.01791

49. Enforcing Fairness Where It Matters: An Approach Based on Difference-of-Convex Constraints [arXiv:2505.12530] #

Authors: Yutian He, Yankun Huang, Yao Yao, Qihang Lin

Abstract: Fairness in machine learning has become a critical concern, particularly in high-stakes applications. Existing approaches often focus on achieving full fairness across all score ranges generated by predictive models, ensuring fairness in both high and low-scoring populations. However, this stringent requirement can compromise predictive performance and may not align with the practical fairness concerns of stakeholders. In this work, we propose a novel framework for building partially fair machine learning models, which enforce fairness within a specific score range of interest, such as the middle range where decisions are most contested, while maintaining flexibility in other regions. We introduce two statistical metrics to rigorously evaluate partial fairness within a given score range, such as the top 20%-40% of scores. To achieve partial fairness, we propose an in-processing method by formulating the model training problem as constrained optimization with difference-of-convex constraints, which can be solved by an inexact difference-of-convex algorithm (IDCA). We provide the complexity analysis of IDCA for finding a nearly KKT point. Through numerical experiments on real-world datasets, we demonstrate that our framework achieves high predictive performance while enforcing partial fairness where it matters most.

URL:

48. A smoothing moving balls approximation method for a class of conic-constrained difference-of-convex optimization problems [arXiv:2505.12314] #

Authors: Jiefeng Xu, Ting Kei Pong, Nung-sing Sze

Abstract: In this paper, we consider the problem of minimizing a difference-of-convex objective over a nonlinear conic constraint, where the cone is closed, convex, pointed and has a nonempty interior. We assume that the support function of a compact base of the polar cone exhibits a majorizing smoothing approximation, a condition that is satisfied by widely studied cones such as $\mathbb{R}^m_-$ and ${\cal S}^m_-$. Leveraging this condition, we reformulate the conic constraint equivalently as a single constraint involving the aforementioned support function, and adapt the moving balls approximation (MBA) method for its solution. In essence, in each iteration of our algorithm, we approximate the support function by a smooth approximation function and apply one MBA step. The subproblems that arise in our algorithm always involve only one single inequality constraint, and can thus be solved efficiently via one-dimensional root-finding procedures. We design explicit rules to evolve the smooth approximation functions from iteration to iteration and establish the corresponding iteration complexity for obtaining an $ε$-Karush-Kuhn-Tucker point. In addition, in the convex setting, we establish convergence of the sequence generated, and study its local convergence rate under a standard Hölderian growth condition. Finally, we illustrate numerically the effects of different rules of evolving the smooth approximation functions on the rate of convergence.

URL: https://arxiv.org/abs/2505.12314

47. A preconditioned difference of convex functions algorithm with extrapolation and line search [arXiv:2505.11914] #

Authors: Ran Zhang, Hongpeng Sun

Abstract: This paper proposes a novel proximal difference-of-convex (DC) algorithm enhanced with extrapolation and aggressive non-monotone line search for solving non-convex optimization problems. We introduce an adaptive conservative update strategy of the extrapolation parameter determined by a computationally efficient non-monotone line search. The core of our algorithm is to unite the update of the extrapolation parameter with the step size of the non-monotone line search interactively. The global convergence of the two proposed algorithms is established through the Kurdyka-Łojasiewicz properties, ensuring convergence within a preconditioned framework for linear equations. Numerical experiments on two general non-convex problems: SCAD-penalized binary classification and graph-based Ginzburg-Landau image segmentation models, demonstrate the proposed method’s high efficiency compared to existing DC algorithms both in convergence rate and solution accuracy.

URL:

46. Contractive difference-of-convex algorithms [arXiv:2505.10800] #

Authors: Songnian He, Qiao-Li Dong, Michael Th. Rassias

Abstract: The difference-of-convex algorithm (DCA) and its variants are the most popular methods to solve the difference-of-convex optimization problem. Each iteration of them is reduced to a convex optimization problem, which generally needs to be solved by iterative methods such as proximal gradient algorithm. However, these algorithms essentially belong to some iterative methods of fixed point problems of averaged mappings, and their convergence speed is generally slow. Furthermore, there is seldom research on the termination rule of these iterative algorithms solving the subproblem of DCA. To overcome these defects, we ffrstly show that the subproblem of the linearized proximal method (LPM) in each iteration is equal to the ffxed point problem of a contraction. Secondly, by using Picard iteration to approximately solve the subproblem of LPM in each iteration, we propose a contractive difference-ofconvex algorithm (cDCA) where an adaptive termination rule is presented. Both global subsequential convergence and global convergence of the whole sequence of cDCA are established. Finally, preliminary results from numerical experiments are promising.

URL: https://link.springer.com/article/10.1007/s10957-025-02689-2

Journal: Journal of Optimization Theory and Applications

45. A full splitting algorithm for structured difference-of-convex programs [arXiv:2505.02588] #

Authors: Radu Ioan Bot, Rossen Nenov, Min Tao

Abstract: In this paper, we study a class of nonconvex and nonsmooth structured difference-of-convex (DC) programs, which contain in the convex part the sum of a nonsmooth linearly composed convex function and a differentiable function, and in the concave part another nonsmooth linearly composed convex function. Among the various areas in which such problems occur, we would like to mention in particular the recovery of sparse signals. We propose an adaptive double-proximal, full-splitting algorithm with a moving center approach in the final subproblem, which addresses the challenge of evaluating compositions by decoupling the linear operator from the nonsmooth component. We establish the subsequential convergence of the generated sequence of iterates to an approximate stationary point and prove its global convergence under the Kurdyka-Łojasiewicz property. We also discuss the tightness of the convergence results and provide insights into the rationale for seeking an approximate KKT point. This is illustrated by constructing a counterexample showing that the algorithm can diverge when seeking exact solutions. Finally, we present a practical version of the algorithm that incorporates a nonmonotone line search, which significantly improves the convergence performance.

URL:

44. Optimization over Trained Neural Networks: Difference-of-Convex Algorithm and Application to Data Center Scheduling [arXiv:2503.17506] #

Authors: Xinwei Liu, Vladimir Dvorkin

Abstract: When solving decision-making problems with mathematical optimization, some constraints or objectives may lack analytic expressions but can be approximated from data. When an approximation is made by neural networks, the underlying problem becomes optimization over trained neural networks. Despite recent improvements with cutting planes, relaxations, and heuristics, the problem remains difficult to solve in practice. We propose a new solution based on a bilinear problem reformulation that penalizes ReLU constraints in the objective function. This reformulation makes the problem amenable to efficient difference-of-convex algorithms (DCA), for which we propose a principled approach to penalty selection that facilitates convergence to stationary points of the original problem. We apply the DCA to the problem of the least-cost allocation of data center electricity demand in a power grid, reporting significant savings in congested cases.

URL:

43. Tight Analysis of Difference-of-Convex Algorithm (DCA) Improves Convergence Rates for Proximal Gradient Descent [arXiv:2503.04486] #

Authors: Teodor Rotaru, Panagiotis Patrinos, François Glineur

Abstract: We investigate a difference-of-convex (DC) formulation where the second term is allowed to be weakly convex. We examine the precise behavior of a single iteration of the difference-of-convex algorithm (DCA), providing a tight characterization of the objective function decrease, distinguishing between six distinct parameter regimes. Our proofs, inspired by the performance estimation framework, are notably simplified compared to related prior research. We subsequently derive sublinear convergence rates for the DCA towards critical points, assuming at least one of the functions is smooth. Additionally, we explore the underexamined equivalence between proximal gradient descent (PGD) and DCA iterations, demonstrating how DCA, a parameter-free algorithm, without the need for a stepsize, serves as a tool for studying the exact convergence rates of PGD.

URL:

42. Abstract nonautonomous difference inclusions in locally convex spaces [arXiv:2502.05184] #

Authors: Marko Kostic

Abstract: In this paper, we consider abstract nonautonomous difference inclusions in locally convex spaces with integer order differences. We particularly analyze the existence and uniqueness of almost periodic type solutions to abstract nonautonomous difference inclusions. Our results seem to be completely new even in the Banach space setting.

URL:

41. Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees [arXiv:2502.00240] #

Authors: Yasi Zhang, Oscar Leong

Abstract: Learning effective regularization is crucial for solving ill-posed inverse problems, which arise in a wide range of scientific and engineering applications. While data-driven methods that parameterize regularizers using deep neural networks have demonstrated strong empirical performance, they often result in highly nonconvex formulations that lack theoretical guarantees. Recent work has shown that incorporating structured nonconvexity into neural network-based regularizers, such as weak convexity, can strike a balance between empirical performance and theoretical tractability. In this paper, we demonstrate that a broader class of nonconvex functions, difference-of-convex (DC) functions, can yield improved empirical performance while retaining strong convergence guarantees. The DC structure enables the use of well-established optimization algorithms, such as the Difference-of-Convex Algorithm (DCA) and a Proximal Subgradient Method (PSM), which extend beyond standard gradient descent. Furthermore, we provide theoretical insights into the conditions under which optimal regularizers can be expressed as DC functions. Extensive experiments on computed tomography (CT) reconstruction tasks show that our approach achieves strong performance across sparse and limited-view settings, consistently outperforming other weakly supervised learned regularizers. Our code is available at \url{https://github.com/YasminZhang/ADCR}.

URL:

40. An Inexact Boosted Difference of Convex Algorithm for Nondifferentiable Functions [arXiv:2412.05697] #

Authors: Orizon P. Ferreira, Boris S. Mordukhovich, Wilkreffy M. S. Santos, João Carlos O. Souza

Abstract: In this paper, we introduce an inexact approach to the Boosted Difference of Convex Functions Algorithm (BDCA) for solving nonconvex and nondifferentiable problems involving the difference of two convex functions (DC functions). Specifically, when the first DC component is differentiable and the second may be nondifferentiable, BDCA utilizes the solution from the subproblem of the DC Algorithm (DCA) to define a descent direction for the objective function. A monotone linesearch is then performed to find a new point that improves the objective function relative to the subproblem solution. This approach enhances the performance of DCA. However, if the first DC component is nondifferentiable, the BDCA direction may become an ascent direction, rendering the monotone linesearch ineffective. To address this, we propose an Inexact nonmonotone Boosted Difference of Convex Algorithm (InmBDCA). This algorithm incorporates two main features of inexactness: First, the subproblem therein is solved approximately allowing us for a controlled relative error tolerance in defining the linesearch direction. Second, an inexact nonmonotone linesearch scheme is used to determine the step size for the next iteration. Under suitable assumptions, we demonstrate that InmBDCA is well-defined, with any accumulation point of the sequence generated by InmBDCA being a critical point of the problem. We also provide iteration-complexity bounds for the algorithm. Numerical experiments show that InmBDCA outperforms both the nonsmooth BDCA (nmBDCA) and the monotone version of DCA in practical scenarios.

URL:

39. A preconditioned second-order convex splitting algorithm with a difference of varying convex functions and line search [arXiv:2411.07661] #

Authors: Xinhua Shen, Zaijiu Shang, Hongpeng Sun

Abstract: This paper introduces a preconditioned convex splitting algorithm enhanced with line search techniques for nonconvex optimization problems. The algorithm utilizes second-order backward differentiation formulas (BDF) for the implicit and linear components and the Adams-Bashforth scheme for the nonlinear and explicit parts of the gradient flow in variational functions. The proposed algorithm, resembling a generalized difference-of-convex-function approach, involves a changing set of convex functions in each iteration. It integrates the Armijo line search strategy to improve performance. The study also discusses classical preconditioners such as symmetric Gauss-Seidel, Jacobi, and Richardson within this context. The global convergence of the algorithm is established through the Kurdyka-Łojasiewicz properties, ensuring convergence within a finite number of preconditioned iterations. Numerical experiments demonstrate the superiority of the proposed second-order convex splitting with line search over conventional difference-of-convex-function algorithms.

URL:

38. Inertial Proximal Difference-of-Convex Algorithm with Convergent Bregman Plug-and-Play for Nonconvex Imaging [arXiv:2409.03262] #

Authors: Tsz Ching Chow, Chaoyan Huang, Zhongming Wu, Tieyong Zeng, Angelica I. Aviles-Rivero

Abstract: Imaging tasks are typically tackled using a structured optimization framework. This paper delves into a class of algorithms for difference-of-convex (DC) structured optimization, focusing on minimizing a DC function along with a possibly nonconvex function. Existing DC algorithm (DCA) versions often fail to effectively handle nonconvex functions or exhibit slow convergence rates. We propose a novel inertial proximal DC algorithm in Bregman geometry, named iBPDCA, designed to address nonconvex terms and enhance convergence speed through inertial techniques. We provide a detailed theoretical analysis, establishing both subsequential and global convergence of iBPDCA via the Kurdyka-Łojasiewicz property. Additionally, we introduce a Plug-and-Play variant, PnP-iBPDCA, which employs a deep neural network-based prior for greater flexibility and robustness while ensuring theoretical convergence. We also establish that the Gaussian gradient step denoiser used in our method is equivalent to evaluating the Bregman proximal operator for an implicitly weakly convex functional. We extensively validate our method on Rician noise and phase retrieval. We demonstrate that iBPDCA surpasses existing state-of-the-art methods.

URL:

37. Constructing Tight Quadratic Relaxations for Global Optimization: II. Underestimating Difference-of-Convex (D.C.) Functions [arXiv:2408.13058] #

Authors: William R. Strahl, Arvind U. Raghunathan, Nikolaos V. Sahinidis, Chrysanthos E. Gounaris

Abstract: Recent advances in the efficiency and robustness of algorithms solving convex quadratically constrained quadratic programming (QCQP) problems motivate developing techniques for creating convex quadratic relaxations that, although more expensive to compute, provide tighter bounds than their classical linear counterparts. In the first part of this two-paper series [Strahl et al., 2024], we developed a cutting plane algorithm to construct convex quadratic underestimators for twice-differentiable convex functions, which we extend here to address the case of non-convex difference-of-convex (d.c.) functions as well. Furthermore, we generalize our approach to consider a hierarchy of quadratic forms, thereby allowing the construction of even tighter underestimators. On a set of d.c. functions extracted from benchmark libraries, we demonstrate noteworthy reduction in the hypervolume between our quadratic underestimators and linear ones constructed at the same points. Additionally, we construct convex QCQP relaxations at the root node of a spatial branch-and-bound tree for a set of systematically created d.c. optimization problems in up to four dimensions, and we show that our relaxations reduce the gap between the lower bound computed by the state-of-the-art global optimization solver BARON and the optimal solution by an excess of 90%, on average.

URL:

36. Distributed Difference of Convex Optimization [arXiv:2407.16728] #

Authors: Vivek Khatana, Murti V. Salapaka

Abstract: In this article, we focus on solving a class of distributed optimization problems involving $n$ agents with the local objective function at every agent $i$ given by the difference of two convex functions $f_i$ and $g_i$ (difference-of-convex (DC) form), where $f_i$ and $g_i$ are potentially nonsmooth. The agents communicate via a directed graph containing $n$ nodes. We create smooth approximations of the functions $f_i$ and $g_i$ and develop a distributed algorithm utilizing the gradients of the smooth surrogates and a finite-time approximate consensus protocol. We term this algorithm as DDC-Consensus. The developed DDC-Consensus algorithm allows for non-symmetric directed graph topologies and can be synthesized distributively. We establish that the DDC-Consensus algorithm converges to a stationary point of the nonconvex distributed optimization problem. The performance of the DDC-Consensus algorithm is evaluated via a simulation study to solve a nonconvex DC-regularized distributed least squares problem. The numerical results corroborate the efficacy of the proposed algorithm.

URL:

35. An Inexact Bregman Proximal Difference-of-Convex Algorithm with Two Types of Relative Stopping Criteria [arXiv:2406.04646] #

Authors: Lei Yang, Jingjing Hu, Kim-Chuan Toh

Abstract: In this paper, we consider a class of difference-of-convex (DC) optimization problems, which require only a weaker restricted $L$-smooth adaptable property on the smooth part of the objective function, instead of the standard global Lipschitz gradient continuity assumption. Such problems are prevalent in many contemporary applications such as compressed sensing, statistical regression, and machine learning, and can be solved by a general Bregman proximal DC algorithm (BPDCA). However, the existing BPDCA is developed based on the stringent requirement that the involved subproblems must be solved exactly, which is often impractical and limits the applicability of the BPDCA. To facilitate the practical implementations and wider applications of the BPDCA, we develop an inexact Bregman proximal difference-of-convex algorithm (iBPDCA) by incorporating two types of relative-type stopping criteria for solving the subproblems. The proposed inexact framework has considerable flexibility to encompass many existing exact and inexact methods, and can accommodate different types of errors that may occur when solving the subproblem. This enables the potential application of our inexact framework across different DC decompositions to facilitate the design of a more efficient DCA scheme in practice. The global subsequential convergence and the global sequential convergence of our iBPDCA are established under suitable conditions including the Kurdyka-Łojasiewicz property. Some numerical experiments are conducted to show the superior performance of our iBPDCA in comparison to existing algorithms. These results also empirically validate the necessity and significance of developing different types of stopping criteria to facilitate the efficient computation of the subproblem in each iteration of our iBPDCA.

URL:

34. Single-Loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions [arXiv:2405.18577] #

Authors: Quanqi Hu, Qi Qi, Zhaosong Lu, Tianbao Yang

Abstract: In this paper, we study a class of non-smooth non-convex problems in the form of $\min_{x}[\max_{y\in Y}φ(x, y) - \max_{z\in Z}ψ(x, z)]$, where both $Φ(x) = \max_{y\in Y}φ(x, y)$ and $Ψ(x)=\max_{z\in Z}ψ(x, z)$ are weakly convex functions, and $φ(x, y), ψ(x, z)$ are strongly concave functions in terms of $y$ and $z$, respectively. It covers two families of problems that have been studied but are missing single-loop stochastic algorithms, i.e., difference of weakly convex functions and weakly convex strongly-concave min-max problems. We propose a stochastic Moreau envelope approximate gradient method dubbed SMAG, the first single-loop algorithm for solving these problems, and provide a state-of-the-art non-asymptotic convergence rate. The key idea of the design is to compute an approximate gradient of the Moreau envelopes of $Φ, Ψ$ using only one step of stochastic gradient update of the primal and dual variables. Empirically, we conduct experiments on positive-unlabeled (PU) learning and partial area under ROC curve (pAUC) optimization with an adversarial fairness regularizer to validate the effectiveness of our proposed algorithms.

URL:

33. Improved convergence rates for the Difference-of-Convex algorithm [arXiv:2403.16864] #

Authors: Teodor Rotaru, Panagiotis Patrinos, François Glineur

Abstract: We consider a difference-of-convex formulation where one of the terms is allowed to be hypoconvex (or weakly convex). We first examine the precise behavior of a single iteration of the Difference-of-Convex algorithm (DCA), giving a tight characterization of the objective function decrease. This requires distinguishing between eight distinct parameter regimes. Our proofs are inspired by the performance estimation framework, but are much simplified compared to similar previous work. We then derive sublinear DCA convergence rates towards critical points, distinguishing between cases where at least one of the functions is smooth and where both functions are nonsmooth. We conjecture the tightness of these rates for four parameter regimes, based on strong numerical evidence obtained via performance estimation, as well as the leading constant in the asymptotic sublinear rate for two more regimes.

URL:

32. An Efficient Difference-of-Convex Solver for Privacy Funnel [arXiv:2403.04778] #

Authors: Teng-Hui Huang, Hesham El Gamal

Abstract: We propose an efficient solver for the privacy funnel (PF) method, leveraging its difference-of-convex (DC) structure. The proposed DC separation results in a closed-form update equation, which allows straightforward application to both known and unknown distribution settings. For known distribution case, we prove the convergence (local stationary points) of the proposed non-greedy solver, and empirically show that it outperforms the state-of-the-art approaches in characterizing the privacy-utility trade-off. The insights of our DC approach apply to unknown distribution settings where labeled empirical samples are available instead. Leveraging the insights, our alternating minimization solver satisfies the fundamental Markov relation of PF in contrast to previous variational inference-based solvers. Empirically, we evaluate the proposed solver with MNIST and Fashion-MNIST datasets. Our results show that under a comparable reconstruction quality, an adversary suffers from higher prediction error from clustering our compressed codes than that with the compared methods. Most importantly, our solver is independent to private information in inference phase contrary to the baselines.

URL:

31. Approximation analysis for the minimization problem of difference-of-convex functions with Moreau envelopes [arXiv:2402.13461] #

Authors: Yan Tang, Shiqing Zhang

Abstract: In this work the minimization problem for the difference of convex (DC) functions is studied by using Moreau envelopes and the descent method with Moreau gradient is employed to approximate the numerical solution. The main regularization idea in this work is inspired by Hiriart-Urruty [14], Moudafi[17], regularize the components of the DC problem by adapting the different parameters and strategic matrices flexibly to evaluate the whole DC problem. It is shown that the inertial gradient method as well as the classic gradient descent scheme tend towards an approximation stationary point of the original problem.

URL:

30. The Boosted Difference of Convex Functions Algorithm for Value-at-Risk Constrained Portfolio Optimization [arXiv:2402.09194] #

Authors: Marah-Lisanne Thormann, Phan Tu Vuong, Alain B. Zemkoho

Abstract: A highly relevant problem of modern finance is the design of Value-at-Risk (VaR) optimal portfolios. Due to contemporary financial regulations, banks and other financial institutions are tied to use the risk measure to control their credit, market and operational risks. For a portfolio with a discrete return distribution and finitely many scenarios, a Difference of Convex (DC) functions representation of the VaR can be derived. Wozabal (2012) showed that this yields a solution to a VaR constrained Markowitz style portfolio selection problem using the Difference of Convex Functions Algorithm (DCA). A recent algorithmic extension is the so-called Boosted Difference of Convex Functions Algorithm (BDCA) which accelerates the convergence due to an additional line search step. It has been shown that the BDCA converges linearly for solving non-smooth quadratic problems with linear inequality constraints. In this paper, we prove that the linear rate of convergence is also guaranteed for a piecewise linear objective function with linear equality and inequality constraints using the Kurdyka-Łojasiewicz property. An extended case study under consideration of best practices for comparing optimization algorithms demonstrates the superiority of the BDCA over the DCA for real-world financial market data. We are able to show that the results of the BDCA are significantly closer to the efficient frontier compared to the DCA. Due to the open availability of all data sets and code, this paper further provides a practical guide for transparent and easily reproducible comparisons of VaR constrained portfolio selection problems in Python.

URL:

29. A Globally Convergent Algorithm for Neural Network Parameter Optimization Based on Difference-of-Convex Functions [arXiv:2401.07936] #

Authors: Daniel Tschernutter, Mathias Kraus, Stefan Feuerriegel

Abstract: We propose an algorithm for optimizing the parameters of single hidden layer neural networks. Specifically, we derive a blockwise difference-of-convex (DC) functions representation of the objective function. Based on the latter, we propose a block coordinate descent (BCD) approach that we combine with a tailored difference-of-convex functions algorithm (DCA). We prove global convergence of the proposed algorithm. Furthermore, we mathematically analyze the convergence rate of parameters and the convergence rate in value (i.e., the training loss). We give conditions under which our algorithm converges linearly or even faster depending on the local shape of the loss function. We confirm our theoretical derivations numerically and compare our algorithm against state-of-the-art gradient-based solvers in terms of both training loss and test loss.

URL:

28. Higher-order tensor methods for minimizing difference of convex functions [arXiv:2401.05063] #

Authors: Ion Necoara

Abstract: Higher-order tensor methods were recently proposed for minimizing smooth convex and nonconvex functions. Higher-order algorithms accelerate the convergence of the classical first-order methods thanks to the higher-order derivatives used in the updates. The purpose of this paper is twofold. Firstly, to show that the higher-order algorithmic framework can be generalized and successfully applied to (nonsmooth) difference of convex functions, namely, those that can be expressed as the difference of two smooth convex functions and a possibly nonsmooth convex one. We also provide examples when the subproblem can be solved efficiently, even globally. Secondly, to derive a complete convergence analysis for our higher-order difference of convex functions (HO-DC) algorithm. In particular, we prove that any limit point of the HO-DC iterative sequence is a critical point of the problem under consideration, the corresponding objective value is monotonically decreasing and the minimum value of the norms of its subgradients converges globally to zero at a sublinear rate. The sublinear or linear convergence rates of the iterations are obtained under the Kurdyka-Lojasiewicz property.

URL:

27. Handling nonlinearities and uncertainties of fed-batch cultivations with difference of convex functions tube MPC [arXiv:2312.00847] #

Authors: Niels Krausch, Martin Doff-Sotta, Mark Canon, Peter Neubauer, Mariano Nicolas Cruz Bournazou

Abstract: Bioprocesses are often characterized by nonlinear and uncertain dynamics. This poses particular challenges in the context of model predictive control (MPC). Several approaches have been proposed to solve this problem, such as robust or stochastic MPC, but they can be computationally expensive when the system is nonlinear. Recent advances in optimal control theory have shown that concepts from convex optimization, tube-based MPC, and difference of convex functions (DC) enable stable and robust online process control. The approach is based on systematic DC decompositions of the dynamics and successive linearizations around feasible trajectories. By convexity, the linearization errors can be bounded tightly and treated as bounded disturbances in a robust tube-based MPC framework. However, finding the DC composition can be a difficult task. To overcome this problem, we used a neural network with special convex structure to learn the dynamics in DC form and express the uncertainty sets using simplices to maximize the product formation rate of a cultivation with uncertain substrate concentration in the feed. The results show that this is a promising approach for computationally tractable data-driven robust MPC of bioprocesses.

URL:

26. A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces [arXiv:2310.17610] #

Authors: Jonathan W. Siegel, Stephan Wojtowytsch

Abstract: We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following:

If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can be arbitrarily slow.
If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as $t\to\infty$.
In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $\infty$, even for a fixed quadratic objective.
In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions $g(t)$ which decrease to zero slower than $f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. For instance, we show that any gradient flow $x_t$ of a convex function $f$ in finite dimension satisfies $\liminf _{t\to\infty} \big(t\cdot \log^2(t)\cdot \big{f(x _t) -\inf f\big}\big)=0$. This improves on the commonly reported $O(1/t)$ rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate $O(1/(tφ(t)))$ for any function $φ$ which satisfies $\lim _{t\to\infty}φ(t) = \infty$, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to \inf f$ almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the $O(1/n)$ decay estimate.

URL:

25. Large Convex sets in Difference sets [arXiv:2309.07527] #

Authors: Krishnendu Bhowmick, Ben Lund, Oliver Roche-Newton

Abstract: We give a construction of a convex set $A \subset \mathbb R$ with cardinality $n$ such that $A-A$ contains a convex subset with cardinality $Ω(n^2)$. We also consider the following variant of this problem: given a convex set $A$, what is the size of the largest matching $M \subset A \times A$ such that the set [ { a-b : (a,b) \in M } ] is convex? We prove that there always exists such an $M$ with $|M| \geq \sqrt n$, and that this lower bound is best possible, up a multiplicative constant.

URL:

24. Moreau Envelope Based Difference-of-weakly-Convex Reformulation and Algorithm for Bilevel Programs [arXiv:2306.16761] #

Authors: Lucy L. Gao, Jane J. Ye, Haian Yin, Shangzhi Zeng, Jin Zhang

Abstract: Bilevel programming has emerged as a valuable tool for hyperparameter selection, a central concern in machine learning. In a recent study by Ye et al. (2023), a value function-based difference of convex algorithm was introduced to address bilevel programs. This approach proves particularly powerful when dealing with scenarios where the lower-level problem exhibits convexity in both the upper-level and lower-level variables. Examples of such scenarios include support vector machines and $\ell_1$ and $\ell_2$ regularized regression. In this paper, we significantly expand the range of applications, now requiring convexity only in the lower-level variables of the lower-level program. We present an innovative single-level difference of weakly convex reformulation based on the Moreau envelope of the lower-level problem. We further develop a sequentially convergent Inexact Proximal Difference of Weakly Convex Algorithm (iP-DwCA). To evaluate the effectiveness of the proposed iP-DwCA, we conduct numerical experiments focused on tuning hyperparameters for kernel support vector machines on simulated data.

URL:

23. Generalized Graph Signal Sampling by Difference-of-Convex Optimization [arXiv:2306.14634] #

Authors: Keitaro Yamashita, Kazuki Naganuma, Shunsuke Ono

Abstract: We propose a desigining method of a flexible sampling operator for graph signals via a difference-of-convex (DC) optimization algorithm. A fundamental challenge in graph signal processing is sampling, especially for graph signals that are not bandlimited. In order to sample beyond bandlimited graph signals, there are studies to expand the generalized sampling theory for the graph setting. Vertex-wise sampling and flexible sampling are two main strategies to sample graph signals. Recovery accuracy of existing vertex-wise sampling methods is highly dependent on specific vertices selected to generate a sampled graph signal that may compromise the accurary especially when noise is generated at the vertices. In contrast, a flexible sampling mixes values at multiple vertices to generate a sampled signal for robust sampling; however, existing flexible sampling methods impose strict assumptions and aggressive relaxations. To address these limitations, we aim to design a flexible sampling operator without such strict assumptions and aggressive relaxations by introducing DC optimization. By formulating the problem of designing a flexible sampling operator as a DC optimization problem, our method ensures robust sampling for graph signals under arbitrary priors based on generalized sampling theory. We develop an efficient solver based on the general double-proximal gradient DC algorithm, which guarantees convergence to a critical point. Experimental results demonstrate the superiority of our method in sampling and recovering beyond bandlimited graph signals compared to existing approaches.

URL:

22. A globally convergent difference-of-convex algorithmic framework and application to log-determinant optimization problems [arXiv:2306.02001] #

Authors: Chaorui Yao, Xin Jiang

Abstract: The difference-of-convex algorithm (DCA) is a conceptually simple method for the minimization of (possibly) nonconvex functions that are expressed as the difference of two convex functions. At each iteration, DCA constructs a global overestimator of the objective and solves the resulting convex subproblem. Despite its conceptual simplicity, the theoretical understanding and algorithmic framework of DCA needs further investigation. In this paper, global convergence of DCA at a linear rate is established under an extended Polyak–Łojasiewicz condition. The proposed condition holds for a class of DC programs with a bounded, closed, and convex constraint set, for which global convergence of DCA cannot be covered by existing analyses. Moreover, the DCProx computational framework is proposed, in which the DCA subproblems are solved by a primal–dual proximal algorithm with Bregman distances. With a suitable choice of Bregman distances, DCProx has simple update rules with cheap per-iteration complexity. As an application, DCA is applied to several fundamental problems in network information theory, for which no existing numerical methods are able to compute the global optimum. For these problems, our analysis proves the global convergence of DCA, and more importantly, DCProx solves the DCA subproblems efficiently. Numerical experiments are conducted to verify the efficiency of DCProx.

URL:

21. A property of strictly convex functions which differ from each other by a constant on the boundary of their domain [arXiv:2305.12183] #

Authors: Biagio Ricceri

Abstract: In this paper, in particular, we prove the following result: Let $E$ be a reflexive real Banach space and let $C\subset E$ be a closed convex set, with non-empty interior, whose boundary is sequentially weakly closed and non-convex. Then, for every function $\varphi:\partial C\to {\bf R}$ and for every convex set $S\subseteq E^$ dense in $E^*$, there exists $\tilde{γ} \in S$ having the following property: for every strictly convex lower semicontinuous function $J:C \to {\bf R}$, Gâteaux differentiable in $\hbox {int}(C)$, such that $J _{\mid\partial C}-\varphi$ is constant in $\partial C$ and $\lim _{|x|\to +\infty}{{J(x)}\over {|x|}} = +\infty$ if $C$ is unbounded, $\tilde{γ}$ is an algebraically interior point of $J’(\hbox {\int}(C))$ (with respect to $E^$).

URL:

20. Local Differences Determined by Convex sets [arXiv:2304.00888] #

Authors: Krishnendu Bhowmick, Miriam Patry, Oliver Roche-Newton

Abstract: This paper introduces a new problem concerning additive properties of convex sets. Let $S= {s_1 < \dots <s_n }$ be a set of real numbers and let $D_i(S)= {s_x-s_y: 1 \leq x-y \leq i}$. We expect that $D_i(S)$ is large, with respect to the size of $S$ and the parameter $i$, for any convex set $S$. We give a construction to show that $D_3(S)$ can be as small as $n+2$, and show that this is the smallest possible size. On the other hand, we use an elementary argument to prove a non-trivial lower bound for $D_4(S)$, namely $|D_4(S)| \geq \frac{5}{4}n -1$. For sufficiently large values of $i$, we are able to prove a non-trivial bound that grows with $i$ using incidence geometry.

URL:

19. Preconditioned Algorithm for Difference of Convex Functions with applications to Graph Ginzburg-Landau Model [arXiv:2303.14495] #

Authors: Xinhua Shen, Hongpeng Sun, Xuecheng Tai

Abstract: In this work, we propose and study a preconditioned framework with a graphic Ginzburg-Landau functional for image segmentation and data clustering by parallel computing. Solving nonlocal models is usually challenging due to the huge computation burden. For the nonconvex and nonlocal variational functional, we propose several damped Jacobi and generalized Richardson preconditioners for the large-scale linear systems within a difference of convex functions algorithms framework. They are efficient for parallel computing with GPU and can leverage the computational cost. Our framework also provides flexible step sizes with a global convergence guarantee. Numerical experiments show the proposed algorithms are very competitive compared to the singular value decomposition based spectral method.

URL:

18. Multi-UAV trajectory planning problem using the difference of convex function programming [arXiv:2303.07581] #

Authors: Anh Phuong Ngo, Christian Thomas, Ali Karimoddini, Hieu T. Nguyen

Abstract: The trajectory planning problem for a swarm of multiple UAVs is known as a challenging nonconvex optimization problem, particularly due to a large number of collision avoidance constraints required for individual pairs of UAVs in the swarm. In this paper, we tackle this nonconvexity by leveraging the difference of convex function (DC) programming. We introduce the slack variables to relax and reformulate the collision avoidance conditions and employ the penalty function term to equivalently convert the problem into a DC form. Consequently, we construct a penalty DC algorithm in which we sequentially solve a set of convex optimization problems obtained by linearizing the collision avoidance constraint. The algorithm iteratively tightens the safety condition and reduces the objective cost of the planning problem and the additional penalty term. Numerical results demonstrate the effectiveness of the proposed approach in planning a large number of UAVs in congested space.

URL:

17. Approximate Bilevel Difference Convex Programming for Bayesian Risk Markov Decision Processes [arXiv:2301.11415] #

Authors: Yifan Lin, Enlu Zhou

Abstract: We consider infinite-horizon Markov Decision Processes where parameters, such as transition probabilities, are unknown and estimated from data. The popular distributionally robust approach to addressing the parameter uncertainty can sometimes be overly conservative. In this paper, we utilize the recently proposed formulation, Bayesian risk Markov Decision Process (BR-MDP), to address parameter (or epistemic) uncertainty in MDPs. To solve the infinite-horizon BR-MDP with a class of convex risk measures, we propose a computationally efficient approach called approximate bilevel difference convex programming (ABDCP). The optimization is performed offline and produces the optimal policy that is represented as a finite state controller with desirable performance guarantees. We also demonstrate the empirical performance of the BR-MDP formulation and the proposed algorithm.

URL:

16. Single-Crossing Differences in Convex Environments [arXiv:2212.12009] #

Authors: Navin Kartik, SangMok Lee, Daniel Rappoport

Abstract: An agent’s preferences depend on an ordered parameter or type. We characterize the set of utility functions with single-crossing differences (SCD) in convex environments. These include preferences over lotteries, both in expected utility and rank-dependent utility frameworks, and preferences over bundles of goods and over consumption streams. Our notion of SCD does not presume an order on the choice space. This unordered SCD is necessary and sufficient for ‘‘interval choice’’ comparative statics. We present applications to cheap talk, observational learning, and collective choice, showing how convex environments arise in these problems and how SCD/interval choice are useful. Methodologically, our main characterization stems from a result on linear aggregations of single-crossing functions. △ Less

URL:

15. Control of Uncertain PWA Systems using Difference-of-Convex Decompositions [arXiv:2209.12990] #

Authors: Siddharth H. Nair, Yvonne R. Stürz

Abstract: In this report, we analyze and design feedback policies for discrete-time Piecewise-Affine (PWA) systems with uncertainty in both the affine dynamics and the polytopic partition. The main idea is to utilise the Difference-of-Convex (DC) decomposition of continuous PWA systems to derive quadratic Lyapunov functions as stability certificates and stabilizing affine policies in a higher dimensional space. When projected back to the state space, we obtain time-varying PWQ Lyapunov functions and time-varying PWA feedback policies.

URL:

14. Encoding inductive invariants as barrier certificates: synthesis via difference-of-convex programming [arXiv:2209.09703] #

Authors: Qiuye Wang, Mingshuai Chen, Bai Xue, Naijun Zhan, Joost-Pieter Katoen

Abstract: A barrier certificate often serves as an inductive invariant that isolates an unsafe region from the reachable set of states, and hence is widely used in proving safety of hybrid systems possibly over an infinite time horizon. We present a novel condition on barrier certificates, termed the invariant barrier-certificate condition, that witnesses unbounded-time safety of differential dynamical systems. The proposed condition is the weakest possible one to attain inductive invariance. We show that discharging the invariant barrier-certificate condition – thereby synthesizing invariant barrier certificates – can be encoded as solving an optimization problem subject to bilinear matrix inequalities (BMIs). We further propose a synthesis algorithm based on difference-of-convex programming, which approaches a local optimum of the BMI problem via solving a series of convex optimization problems. This algorithm is incorporated in a branch-and-bound framework that searches for the global optimum in a divide-and-conquer fashion. We present a weak completeness result of our method, namely, a barrier certificate is guaranteed to be found (under some mild assumptions) whenever there exists an inductive invariant (in the form of a given template) that suffices to certify safety of the system. Experimental results on benchmarks demonstrate the effectiveness and efficiency of our approach.

URL:

13. A convex set with a rich difference [arXiv:2208.03258] #

Authors: Oliver Roche-Newton, Audie Warren

Abstract: We construct a convex set $A$ with cardinality $2n$ and with the property that an element of the difference set $A-A$ can be represented in $n$ different ways. We also show that this construction is optimal by proving that for any convex set $A$, the maximum possible number of representations an element of $A-A$ can have is $\lfloor |A|/2 \rfloor $.

URL:

12. Value Function Based Difference-of-Convex Algorithm for Bilevel Hyperparameter Selection Problems [arXiv:2206.05976] #

Authors: Lucy Gao, Jane J. Ye, Haian Yin, Shangzhi Zeng, Jin Zhang

Abstract: Gradient-based optimization methods for hyperparameter tuning guarantee theoretical convergence to stationary solutions when for fixed upper-level variable values, the lower level of the bilevel program is strongly convex (LLSC) and smooth (LLS). This condition is not satisfied for bilevel programs arising from tuning hyperparameters in many machine learning algorithms. In this work, we develop a sequentially convergent Value Function based Difference-of-Convex Algorithm with inexactness (VF-iDCA). We show that this algorithm achieves stationary solutions without LLSC and LLS assumptions for bilevel programs from a broad class of hyperparameter tuning applications. Our extensive experiments confirm our theoretical findings and show that the proposed VF-iDCA yields superior performance when applied to tune hyperparameters.

URL:

11. Decentralized Saddle-Point Problems with Different Constants of Strong Convexity and Strong Concavity [arXiv:2206.00090] #

Authors: Dmitriy Metelev, Alexander Rogozin, Alexander Gasnikov, Dmitry Kovalev

Abstract: Large-scale saddle-point problems arise in such machine learning tasks as GANs and linear models with affine constraints. In this paper, we study distributed saddle-point problems (SPP) with strongly-convex-strongly-concave smooth objectives that have different strong convexity and strong concavity parameters of composite terms, which correspond to min and max variables, and bilinear saddle-point part. We consider two types of first-order oracles: deterministic (returns gradient) and stochastic (returns unbiased stochastic gradient). Our method works in both cases and takes several consensus steps between oracle calls.

URL:

10. The difference of convex algorithm on Hadamard manifolds [arXiv:2112.05250] #

Authors: Ronny Bergmann, Orizon P. Ferreira, Elianderson M. Santos, João Carlos O. Souza

Abstract: In this paper, we propose a Riemannian version of the difference of convex algorithm (DCA) to solve a minimization problem involving the difference of convex (DC) function. We establish the equivalence between the classical and simplified Riemannian versions of the DCA. We also prove that, under mild assumptions, the Riemannian version of the DCA is well-defined, and every cluster point of the sequence generated by the proposed method, if any, is a critical point of the objective DC function. Additionally, we establish some duality relations between the DC problem and its dual. To illustrate the effectiveness of the algorithm, we present some numerical experiments.

URL:

9. Data Fitting with Signomial Programming Compatible Difference of Convex Functions [arXiv:2110.12104] #

Authors: Cody Karcher

Abstract: Signomial Programming (SP) has proven to be a powerful tool for engineering design optimization, striking a balance between the computational efficiency of Geometric Programming (GP) and the extensibility of more general optimization methods like Sequential Quadratic Programming (SQP). But when an existing engineering analysis tool is incompatible with the mathematics of the SP formulation, options are limited. Previous literature has suggested schemes for fitting GP compatible models to pre-computed data, but no methods have yet been proposed that take advantage of the increased modeling flexibility available in SP. This paper describes a new Soft Difference of Max Affine (SDMA) function class that is constructed from existing methods of GP compatible fitting and the theory of Difference of Convex (DC) functions. When a SDMA function is fit to data in log-log transformed space, it becomes either a signomial or a set of signomials upon inverse transformation. Three examples of fitting are presented here, including simple test cases in 2D and 3D, and a fit to the performance data of the NACA 24xx family of airfoils. In each case, RMS error is driven to less than 1%.

URL:

8. Factored couplings in multi-marginal optimal transport via difference of convex programming [arXiv:2110.00629] #

Authors: Quang Huy Tran, Hicham Janati, Ievgen Redko, Rémi Flamary, Nicolas Courty

Abstract: Optimal transport (OT) theory underlies many emerging machine learning (ML) methods nowadays solving a wide range of tasks such as generative modeling, transfer learning and information retrieval. These latter works, however, usually build upon a traditional OT setup with two distributions, while leaving a more general multi-marginal OT formulation somewhat unexplored. In this paper, we study the multi-marginal OT (MMOT) problem and unify several popular OT methods under its umbrella by promoting structural information on the coupling. We show that incorporating such structural information into MMOT results in an instance of a different of convex (DC) programming problem allowing us to solve it numerically. Despite high computational cost of the latter procedure, the solutions provided by DC optimization are usually as qualitative as those obtained using currently employed optimization schemes.

URL:

7. On the rate of convergence of the Difference-of-Convex Algorithm (DCA) [arXiv:2109.13566] #

Authors: Hadi Abbaszadehpeivasti, Etienne de Klerk, Moslem Zamani

Abstract: In this paper, we study the convergence rate of the DCA (Difference-of-Convex Algorithm), also known as the convex-concave procedure, with two different termination criteria that are suitable for smooth and nonsmooth decompositions respectively. The DCA is a popular algorithm for difference-of-convex (DC) problems, and known to converge to a stationary point of the objective under some assumptions. We derive a worst-case convergence rate of $O(1/\sqrt{N})$ after $N$ iterations of the objective gradient norm for certain classes of DC problems, without assuming strong convexity in the DC decomposition, and give an example which shows the convergence rate is exact. We also provide a new convergence rate of $O(1/N)$ for the DCA with the second termination criterion. %In addition, we investigate the DCA with regularization. Moreover, we derive a new linear convergence rate result for the DCA under the assumption of the Polyak-Łojasiewicz inequality. The novel aspect of our analysis is that it employs semidefinite programming performance estimation.

URL:

6. A Different Perspective On The Stochastic Convex Feasibility Problem [arXiv:2108.12029] #

Authors: James Renegar, Song Zhou

Abstract: We analyze a simple randomized subgradient method for approximating solutions to stochastic systems of convex functional constraints, the only input to the algorithm being the size of minibatches. By introducing a new notion of what is meant for a point to approximately solve the constraints, determining bounds on the expected number of iterations reduces to determining a hitting time for a compound Bernoulli process, elementary probability. Besides bounding the expected number of iterations quite generally, we easily establish concentration inequalities on the number of iterations, and more interesting, we establish much-improved bounds when a notion akin to Hölderian growth is satisfied, for all degrees of growth, not just the linear growth of piecewise-linear convex functions or the quadratic growth of strongly convex functions. Finally, we establish the analogous results under a slight modification to the algorithm which results in the user knowing with high confidence an iterate is in hand that approximately solves the system. Perhaps surprisingly, the iteration bounds here are deterministic – all of the probability gets wrapped into the confidence level (albeit at the expense of potentially large minibatches).

URL:

5. Retraction-based first-order feasible methods for difference-of-convex programs with smooth inequality and simple geometric constraints [arXiv:2106.08584] #

Authors: Yongle Zhang, Guoyin Li, Ting Kei Pong, Shiqi Xu

Abstract: In this paper, we propose first-order feasible methods for difference-of-convex (DC) programs with smooth inequality and simple geometric constraints. Our strategy for maintaining feasibility of the iterates is based on a “retraction” idea adapted from the literature of manifold optimization. When the constraints are convex, we establish the global subsequential convergence of the sequence generated by our algorithm under strict feasibility condition, and analyze its convergence rate when the objective is in addition convex according to the Kurdyka-Lojasiewicz (KL) exponent of the extended objective (i.e., sum of the objective and the indicator function of the constraint set). We also show that the extended objective of a large class of Euclidean norm (and more generally, group LASSO penalty) regularized convex optimization problems is a KL function with exponent $\frac12$; consequently, our algorithm is locally linearly convergent when applied to these problems. We then extend our method to solve DC programs with a single specially structured nonconvex constraint. Finally, we discuss how our algorithms can be applied to solve two concrete optimization problems, namely, group-structured compressed sensing problems with Gaussian measurement noise and compressed sensing problems with Cauchy measurement noise, and illustrate the empirical performance of our algorithms.

URL:

4. Synthesizing Invariant Barrier Certificates via Difference-of-Convex Programming [arXiv:2105.14311] #

Authors: Qiuye Wang, Mingshuai Chen, Bai Xue, Naijun Zhan, Joost-Pieter Katoen

Abstract: A barrier certificate often serves as an inductive invariant that isolates an unsafe region from the reachable set of states, and hence is widely used in proving safety of hybrid systems possibly over the infinite time horizon. We present a novel condition on barrier certificates, termed the invariant barrier-certificate condition, that witnesses unbounded-time safety of differential dynamical systems. The proposed condition is by far the least conservative one on barrier certificates, and can be shown as the weakest possible one to attain inductive invariance. We show that discharging the invariant barrier-certificate condition – thereby synthesizing invariant barrier certificates – can be encoded as solving an optimization problem subject to bilinear matrix inequalities (BMIs). We further propose a synthesis algorithm based on difference-of-convex programming, which approaches a local optimum of the BMI problem via solving a series of convex optimization problems. This algorithm is incorporated in a branch-and-bound framework that searches for the global optimum in a divide-and-conquer fashion. We present a weak completeness result of our method, in the sense that a barrier certificate is guaranteed to be found (under some mild assumptions) whenever there exists an inductive invariant (in the form of a given template) that suffices to certify safety of the system. Experimental results on benchmark examples demonstrate the effectiveness and efficiency of our approach.

URL:

3. Algorithms for Difference-of-Convex (DC) Programs Based on Difference-of-Moreau-Envelopes Smoothing [arXiv:2104.01470] #

Authors: Kaizhao Sun, Xu Andy Sun

Abstract: In this paper we consider minimization of a difference-of-convex (DC) function with and without linear constraints. We first study a smooth approximation of a generic DC function, termed difference-of-Moreau-envelopes (DME) smoothing, where both components of the DC function are replaced by their respective Moreau envelopes. The resulting smooth approximation is shown to be Lipschitz differentiable, capture stationary points, local, and global minima of the original DC function, and enjoy some growth conditions, such as level-boundedness and coercivity, for broad classes of DC functions. We then develop four algorithms for solving DC programs with and without linear constraints based on the DME smoothing. In particular, for a smoothed DC program without linear constraints, we show that the classic gradient descent method as well as an inexact variant can obtain a stationary solution in the limit with a convergence rate of $\mathcal{O}(K^{-1/2})$, where $K$ is the number of proximal evaluations of both components. Furthermore, when the DC program is explicitly constrained in an affine subspace, we combine the smoothing technique with the augmented Lagrangian function and derive two variants of the augmented Lagrangian method (ALM), named LCDC-ALM and composite LCDC-ALM, focusing on different structures of the DC objective function. We show that both algorithms find an $ε$-approximate stationary solution of the original DC program in $\mathcal{O}(ε^{-2})$ iterations. Comparing to existing methods designed for linearly constrained weakly convex minimization, the proposed ALM-based algorithms can be applied to a broader class of problems, where the objective contains a nonsmooth concave component. Finally, numerical experiments are presented to demonstrate the performance of the proposed algorithms.

URL:

2. CDiNN -Convex Difference Neural Networks [arXiv:2103.17231] #

Authors: Parameswaran Sankaranarayanan, Raghunathan Rengaswamy

Abstract: Neural networks with ReLU activation function have been shown to be universal function approximators and learn function mapping as non-smooth functions. Recently, there is considerable interest in the use of neural networks in applications such as optimal control. It is well-known that optimization involving non-convex, non-smooth functions are computationally intensive and have limited convergence guarantees. Moreover, the choice of optimization hyper-parameters used in gradient descent/ascent significantly affect the quality of the obtained solutions. A new neural network architecture called the Input Convex Neural Networks (ICNNs) learn the output as a convex function of inputs thereby allowing the use of efficient convex optimization methods. Use of ICNNs for determining the input for minimizing output has two major problems: learning of a non-convex function as a convex mapping could result in significant function approximation error, and we also note that the existing representations cannot capture simple dynamic structures like linear time delay systems. We attempt to address the above problems by introduction of a new neural network architecture, which we call the CDiNN, which learns the function as a difference of polyhedral convex functions from data. We also discuss that, in some cases, the optimal input can be obtained from CDiNN through difference of convex optimization with convergence guarantees and that at each iteration, the problem is reduced to a linear programming problem.

URL:

1. A Difference-of-Convex Cutting Plane Algorithm for Mixed-Binary Linear Program [arXiv:2103.00717] #

Authors: Yi-Shuai Niu, Yu You

Abstract: In this paper, we propose a cutting plane algorithm based on DC (Difference-of-Convex) programming and DC cut for globally solving Mixed-Binary Linear Program (MBLP). We first use a classical DC programming formulation via the exact penalization to formulate MBLP as a DC program, which can be solved by DCA algorithm. Then, we focus on the construction of DC cuts, which serves either as a local cut (namely type-I DC cut) at feasible local minimizer of MBLP, or as a global cut (namely type-II DC cut) at infeasible local minimizer of MBLP if some particular assumptions are verified. Otherwise, the constructibility of DC cut is still unclear, and we propose to use classical global cuts (such as the Lift-and-Project cut) instead. Combining DC cut and classical global cuts, a cutting plane algorithm, namely DCCUT, is established for globally solving MBLP. The convergence theorem of DCCUT is proved. Restarting DCA in DCCUT helps to quickly update the upper bound solution and to introduce more DC cuts for lower bound improvement. A variant of DCCUT by introducing more classical global cuts in each iteration is proposed, and parallel versions of DCCUT and its variant are also designed which use the power of multiple processors for better performance. Numerical simulations of DCCUT type algorithms comparing with the classical cutting plane algorithm using Lift-and-Project cuts are reported. Tests on some specific samples and the MIPLIB 2017 benchmark dataset demonstrate the benefits of DC cut and good performance of DCCUT algorithms.

URL:

Recent Advanced in Research on Difference-of-Convex (DC) Programming

Thu, 27 Jun 2024 23:14:15 +0800

Second-order Stochastic Optimization methods for Machine Learning

Thu, 27 Jun 2024 23:14:15 +0800

Analysis of the Hessian #

1. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks #

Year: 2017
Authors: Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, Leon Bottou
ArXiv ID: arXiv:1706.04454
URL: https://arxiv.org/abs/1706.04454

Abstract: We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

Source Code: No explicit source code information found

2. The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size #

Year: 2018
Authors: Vardan Papyan
ArXiv ID: arXiv:1811.07062
URL: https://arxiv.org/abs/1811.07062

Abstract: We apply state-of-the-art tools in modern high-dimensional numerical linear algebra to approximate efficiently the spectrum of the Hessian of modern deepnets, with tens of millions of parameters, trained on real data. Our results corroborate previous findings, based on small-scale networks, that the Hessian exhibits “spiked” behavior, with several outliers isolated from a continuous bulk. We decompose the Hessian into different components and study the dynamics with training and sample size of each term individually.

Source Code: No explicit source code information found

3. PyHessian: Neural Networks Through the Lens of the Hessian #

Year: 2019
Authors: Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney
ArXiv ID: arXiv:1912.07145
URL: https://arxiv.org/abs/1912.07145

Abstract: We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the loss landscape smoother, thus making it easier for Stochastic Gradient Descent to converge to a good solution. Our extensive analysis shows new finer-scale insights, demonstrating that, while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that Batch Normalization does not necessarily make the loss landscape smoother, especially for shallower networks.

Source Code: Mentions ‘available’ in abstract; Mentions ‘open source’ in abstract; Known repository: https://github.com/amirgholami/PyHessian

4. A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization #

Year: 2020
Authors: Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, Vineeth N Balasubramanian
ArXiv ID: arXiv:2012.03801
URL: https://arxiv.org/abs/2012.03801

Abstract: Loss landscape analysis is extremely useful for a deeper understanding of the generalization ability of deep neural network models. In this work, we propose a layerwise loss landscape analysis where the loss surface at every layer is studied independently and also on how each correlates to the overall loss surface. We study the layerwise loss landscape by studying the eigenspectra of the Hessian at each layer. In particular, our results show that the layerwise Hessian geometry is largely similar to the entire Hessian. We also report an interesting phenomenon where the Hessian eigenspectrum of middle layers of the deep neural network are observed to most similar to the overall Hessian eigenspectrum. We also show that the maximum eigenvalue and the trace of the Hessian (both full network and layerwise) reduce as training of the network progresses. We leverage on these observations to propose a new regularizer based on the trace of the layerwise Hessian. Penalizing the trace of the Hessian at every layer indirectly forces Stochastic Gradient Descent to converge to flatter minima, which are shown to have better generalization performance. In particular, we show that such a layerwise regularizer can be leveraged to penalize the middlemost layers alone, which yields promising results. Our empirical studies on well-known deep nets across datasets support the claims of this work

Source Code: No explicit source code information found

Diagonal Scaling #

1. AdaHessian: An Adaptive Second Order Optimizer for Machine Learning #

Year: 2020
Authors: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
ArXiv ID: arXiv:2006.00719
Algorithm: AdaHessian
URL: https://arxiv.org/abs/2006.00719

Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.

Source Code: Known repository: https://github.com/amirgholami/adahessian

2. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training #

Year: 2023
Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma
ArXiv ID: arXiv:2305.14342
Algorithm: Sophia
URL: https://arxiv.org/abs/2305.14342

Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss.

Source Code: Known repository: https://github.com/Liuhong99/Sophia

Hessian-free Optimization #

1. Learning Recurrent Neural Networks with Hessian-Free Optimization #

Year: 2011
Authors: James Martens, Ilya Sutskever
ArXiv ID:
URL: https://www.cs.toronto.edu/~jmartens/docs/RNN_HF.pdf

Abstract: In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies. Utilizing recent advances in the Hessian-free optimization approach (Martens, 2010), together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems. First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely long-term dependencies), and second, on three natural and highly complex real-world sequence datasets where we find that our method significantly outperforms the previous state-of-the-art method for training neural sequence models: the Long Short-term Memory approach of Hochreiter and Schmidhuber (1997). Additionally, we offer a new interpretation of the generalized Gauss-Newton matrix of Schraudolph (2002) which is used within the HF approach of Martens.

Source Code: No explicit source code information found

2. Training Neural Networks with Stochastic Hessian-Free Optimization #

Year: 2013
Authors: Ryan Kiros
ArXiv ID: arXiv:1301.3641
Algorithm: SHF
URL: https://arxiv.org/abs/1301.3641

Abstract: Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens’ HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.

Source Code: Mentions ‘code’ in abstract

Quasi-Newton #

1. A Stochastic Quasi-Newton Method for Large-Scale Optimization #

Year: 2014
Authors: R.H. Byrd, S.L. Hansen, J. Nocedal, Y. Singer
ArXiv ID: arXiv:1401.7020
URL: https://arxiv.org/abs/1401.7020

Abstract: The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.

Source Code: No explicit source code information found

2. A Multi-Batch L-BFGS Method for Machine Learning #

Year: 2016
Authors: Albert S. Berahas, Jorge Nocedal, Martin Takáč
ArXiv ID: arXiv:1605.06049
URL: https://arxiv.org/abs/1605.06049

Abstract: The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

Source Code: No explicit source code information found

3. Stochastic Quasi-Newton with Line-Search Regularization #

Year: 2019
Authors: Adrian Wills, Thomas Schön
ArXiv ID: arXiv:1909.01238
Algorithm: SQN
URL: https://arxiv.org/abs/1909.01238

Abstract: In this paper we present a novel quasi-Newton algorithm for use in stochastic optimisation. Quasi-Newton methods have had an enormous impact on deterministic optimisation problems because they afford rapid convergence and computationally attractive algorithms. In essence, this is achieved by learning the second-order (Hessian) information based on observing first-order gradients. We extend these ideas to the stochastic setting by employing a highly flexible model for the Hessian and infer its value based on observing noisy gradients. In addition, we propose a stochastic counterpart to standard line-search procedures and demonstrate the utility of this combination on maximum likelihood identification for general nonlinear state space models.

Source Code: No explicit source code information found

4. Practical Quasi-Newton Methods for Training Deep Neural Networks #

Year: 2020
Authors: Donald Goldfarb, Yi Ren, Achraf Bahamou
ArXiv ID: arXiv:2006.08877
URL: https://arxiv.org/abs/2006.08877

Abstract: We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n^2$ elements. Consequently, computing and storing a full $n \times n$ BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

Source Code: Mentions ‘code’ in abstract; Mentions ‘implementation’ in abstract

Gauss-Newton #

1. Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks #

Year: 2019
Authors: Yi Ren, Donald Goldfarb
ArXiv ID: arXiv:1906.02353
Algorithm: SWM-GN, SWM-NG
URL: https://arxiv.org/abs/1906.02353

Abstract: We present practical Levenberg-Marquardt variants of Gauss-Newton and natural gradient methods for solving non-convex optimization problems that arise in training deep neural networks involving enormous numbers of variables and huge data sets. Our methods use subsampled Gauss-Newton or Fisher information matrices and either subsampled gradient estimates (fully stochastic) or full gradients (semi-stochastic), which, in the latter case, we prove convergent to a stationary point. By using the Sherman-Morrison-Woodbury formula with automatic differentiation (backpropagation) we show how our methods can be implemented to perform efficiently. Finally, numerical results are presented to demonstrate the effectiveness of our proposed methods.

Source Code: No explicit source code information found

2. On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs #

Year: 2020
Authors: Matilde Gargiani, et al.
ArXiv ID: arXiv:2006.02409
Algorithm: SGN
URL: https://arxiv.org/abs/2006.02409

Abstract: Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training DNNs. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to compute an approximate search direction, relies on the conjugate gradient method combined with forward and reverse automatic differentiation. Despite the success of SGD and its first-order variants, and despite Hessian-free methods based on the Gauss-Newton Hessian approximation having been already theoretically proposed as practical methods for training DNNs, we believe that SGN has a lot of undiscovered and yet not fully displayed potential in big mini-batch scenarios. For this setting, we demonstrate that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime. This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform, which, unlike Tensorflow and Pytorch, supports forward automatic differentiation. This enables researchers to further study and improve this promising optimization technique and hopefully reconsider stochastic second-order methods as competitive optimization techniques for training DNNs; we also hope that the promise of SGN may lead to forward automatic differentiation being added to Tensorflow or Pytorch. Our results also show that in big mini-batch scenarios SGN is more robust than SGD with respect to its hyperparameters (we never had to tune its step-size for our benchmarks!), which eases the expensive process of hyperparameter tuning that is instead crucial for the performance of first-order methods.

Source Code: Mentions ‘implementation’ in abstract

3. Stochastic Gauss-Newton Algorithms for Nonconvex Compositional Optimization #

Year: 2020
Authors: Quoc Tran-Dinh, et al.
ArXiv ID: arXiv:2002.07290
Algorithm: SGN with SARAH estimators
URL: https://arxiv.org/abs/2002.07290

Abstract: We develop two new stochastic Gauss-Newton algorithms for solving a class of non-convex stochastic compositional optimization problems frequently arising in practice. We consider both the expectation and finite-sum settings under standard assumptions, and use both classical stochastic and SARAH estimators for approximating function values and Jacobians. In the expectation case, we establish $\mathcal{O}(\varepsilon^{-2})$ iteration-complexity to achieve a stationary point in expectation and estimate the total number of stochastic oracle calls for both function value and its Jacobian, where $\varepsilon$ is a desired accuracy. In the finite sum case, we also estimate $\mathcal{O}(\varepsilon^{-2})$ iteration-complexity and the total oracle calls with high probability. To our best knowledge, this is the first time such global stochastic oracle complexity is established for stochastic Gauss-Newton methods. Finally, we illustrate our theoretical results via two numerical examples on both synthetic and real datasets.

Source Code: No explicit source code information found

4. Nonlinear Least Squares for Large-Scale Machine Learning using Stochastic Jacobian Estimates #

Year: 2021
Authors: Johannes J. Brust
ArXiv ID: arXiv:2107.05598
Algorithm: NLLS1, NLLSL
URL: https://arxiv.org/abs/2107.05598

Abstract: For large nonlinear least squares loss functions in machine learning we exploit the property that the number of model parameters typically exceeds the data in one batch. This implies a low-rank structure in the Hessian of the loss, which enables effective means to compute search directions. Using this property, we develop two algorithms that estimate Jacobian matrices and perform well when compared to state-of-the-art methods.

Source Code: No explicit source code information found

5. Improving Levenberg-Marquardt Algorithm for Neural Networks #

Year: 2022
Authors: Omead Pooladzandi, Yiming Zhou
ArXiv ID: arXiv:2212.08769
Algorithm: LM
URL: https://arxiv.org/abs/2212.08769

Abstract: We explore the usage of the Levenberg-Marquardt (LM) algorithm for regression (non-linear least squares) and classification (generalized Gauss-Newton methods) tasks in neural networks. We compare the performance of the LM method with other popular first-order algorithms such as SGD and Adam, as well as other second-order algorithms such as L-BFGS , Hessian-Free and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance.

Source Code: No explicit source code information found

6. Rethinking Gauss-Newton for learning over-parameterized models #

Year: 2023
Authors: Michael Arbel, et al.
ArXiv ID: arXiv:2302.02904
URL: https://arxiv.org/abs/2302.02904

Abstract: This work studies the global convergence and implicit bias of Gauss Newton’s (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN’s method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.

Source Code: No explicit source code information found

7. Exact Gauss-Newton Optimization for Training Deep Neural Networks #

Year: 2024
Authors: Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon
ArXiv ID: arXiv:2405.14402
Algorithm: EGN
URL: https://arxiv.org/abs/2405.14402

Abstract: We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an $\epsilon$-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.

Source Code: No explicit source code information found

Fisher Information #

1. Optimizing Neural Networks with Kronecker-factored Approximate Curvature #

Year: 2015
Authors: James Martens, Roger Grosse
ArXiv ID: arXiv:1503.05671
Algorithm: K-FAC
URL: https://arxiv.org/abs/1503.05671

Abstract: We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network’s Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC’s approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

Source Code: Known repository: Various implementations available

Other #

1. Second-order optimization with lazy Hessians #

Year: 2022
Authors: Nikita Doikov, El Mahdi Chayti, Martin Jaggi
ArXiv ID: arXiv:2212.00781
URL: https://arxiv.org/abs/2212.00781

Abstract: We analyze Newton’s method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every $d$ iterations, where $d$ is the dimension of the problem. This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.

Source Code: No explicit source code information found

Optimization Research Papers in JMLR Volume 24

Fri, 29 Sep 2023 00:00:00 +0000

Optimization Research Papers in JMLR Volume 24 (2023) #

This document lists papers from JMLR Volume 24 (2023) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.

Convex Optimization #

Papers addressing convex optimization problems, including sparse PCA, L0 regularization, and matrix decomposition.

Sparse PCA: A Geometric Approach
Authors: Dimitris Bertsimas, Driss Lahlou Kitane
Description: Develops a geometric approach for sparse principal component analysis using convex optimization techniques.
Fundamental Limits and Algorithms for Sparse Linear Regression with Sublinear Sparsity
Authors: Lan V. Truong
Description: Investigates algorithms and theoretical limits for sparse linear regression with sublinear sparsity in a convex framework.
Sparse Training with Lipschitz Continuous Loss Functions and a Weighted Group L0-norm Constraint
Authors: Michael R. Metel
Description: Proposes sparse training methods using Lipschitz continuous loss functions and group L0-norm constraints.
MARS: A Second-Order Reduction Algorithm for High-Dimensional Sparse Precision Matrices Estimation
Authors: Qian Li, Binyan Jiang, Defeng Sun
Description: Presents a second-order reduction algorithm for sparse precision matrix estimation using convex optimization.
Sparse GCA and Thresholded Gradient Descent
Authors: Sheng Gao, Zongming Ma
Description: Develops sparse generalized correlation analysis with thresholded gradient descent in a convex framework.
A Parameter-Free Conditional Gradient Method for Composite Minimization under Hölder Condition
Authors: Masaru Ito, Zhaosong Lu, Chuan He
Description: Introduces a parameter-free conditional gradient method for composite minimization under Hölder smoothness.
L0Learn: A Scalable Package for Sparse Learning using L0 Regularization
Authors: Hussein Hazimeh, Rahul Mazumder, Tim Nonet
Description: Presents a scalable package for sparse learning with L0 regularization in convex optimization.
Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach
Authors: Dimitris Bertsimas, Ryan Cory-Wright, Nicholas A. G. Johnson
Description: Proposes a discrete optimization approach for sparse plus low-rank matrix decomposition using convex methods.
Distributed Sparse Regression via Penalization
Authors: Yao Ji, Gesualdo Scutari, Ying Sun, Harsha Honnappa
Description: Develops distributed sparse regression algorithms using penalization techniques in convex optimization.
Elastic Gradient Descent, an Iterative Optimization Method Approximating the Solution Paths of the Elastic Net
Authors: Oskar Allerbo, Johan Jonasson, Rebecka Jörnsten
Description: Introduces an iterative method approximating elastic net solution paths in convex settings.
A Novel Integer Linear Programming Approach for Global L0 Minimization
Authors: Diego Delle Donne, Matthieu Kowalski, Leo Liberti
Description: Proposes an integer linear programming approach for global L0 minimization in convex optimization.

Nonconvex Optimization #

Papers tackling nonconvex optimization, focusing on descent algorithms, majorization minimization, and minimax problems.

A Line-Search Descent Algorithm for Strict Saddle Functions with Complexity Guarantees
Authors: Michael J. O’Neill, Stephen J. Wright
Description: Develops a line-search descent algorithm for nonconvex strict saddle functions with complexity guarantees.
An Inertial Block Majorization Minimization Framework for Nonsmooth Nonconvex Optimization
Authors: Le Thi Khanh Hien, Duy Nhat Phan, Nicolas Gillis
Description: Proposes an inertial block majorization minimization framework for nonsmooth nonconvex optimization.
Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the O(epsilon^(-7/4)) Complexity
Authors: Huan Li, Zhouchen Lin
Description: Introduces a restarted accelerated gradient descent method for nonconvex optimization, eliminating polylogarithmic factors.
Preconditioned Gradient Descent for Overparameterized Nonconvex Burer-Monteiro Factorization with Global Optimality Certification
Authors: Gavin Zhang, Salar Fattahi, Richard Y. Zhang
Description: Develops preconditioned gradient descent for nonconvex Burer-Monteiro factorization with global optimality guarantees.
Zeroth-Order Alternating Gradient Descent Ascent Algorithms for A Class of Nonconvex-Nonconcave Minimax Problems
Authors: Zi Xu, Zi-Qi Wang, Jun-Lin Wang, Yu-Hong Dai
Description: Proposes zeroth-order alternating gradient descent ascent for nonconvex-nonconcave minimax problems.

Stochastic Optimization #

Papers focusing on stochastic optimization methods, including gradient descent, proximal point methods, and continuous-time approaches.

On the Convergence of Stochastic Gradient Descent with Bandwidth-Based Step Size
Authors: Xiaoyu Wang, Ya-xiang Yuan
Description: Analyzes convergence of stochastic gradient descent with bandwidth-based step sizes.
Stochastic Optimization under Distributional Drift
Authors: Joshua Cutler, Dmitriy Drusvyatskiy, Zaid Harchaoui
Description: Studies stochastic optimization under distributional drift with theoretical guarantees.
Improved Powered Stochastic Optimization Algorithms for Large-Scale Machine Learning
Authors: Zhuang Yang
Description: Proposes improved powered stochastic optimization algorithms for large-scale machine learning.
Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation
Authors: Xiao-Tong Yuan, Ping Li
Description: Provides a sharper analysis of minibatch stochastic proximal point methods, focusing on stability and smoothness.
A Continuous-Time Stochastic Gradient Descent Method for Continuous Data
Authors: Kexin Jin, Jonas Latz, Chenguang Liu, Carola-Bibiane Schönlieb
Description: Introduces a continuous-time stochastic gradient descent method for continuous data optimization.
Sensitivity-Free Gradient Descent Algorithms
Authors: Ion Matei, Maksym Zhenirovskyy, Johan de Kleer, John Maxwell
Description: Develops sensitivity-free gradient descent algorithms for stochastic optimization.

Distributed/Decentralized Optimization #

Papers addressing distributed or decentralized optimization algorithms, focusing on federated learning, asynchronous updates, and network topology.

Decentralized Learning: Theoretical Optimality and Practical Improvements
Authors: Yucheng Lu, Christopher De Sa
Description: Analyzes theoretical optimality and practical improvements for decentralized learning algorithms.
A General Theory for Federated Optimization with Asynchronous and Heterogeneous Clients Updates
Authors: Yann Fraboni, Richard Vidal, Laetitia Kameni, Marco Lorenzi
Description: Provides a general theory for federated optimization with asynchronous and heterogeneous client updates.
Buffered Asynchronous SGD for Byzantine Learning
Authors: Yi-Rui Yang, Wu-Jun Li
Description: Proposes buffered asynchronous SGD for Byzantine-resilient distributed learning.
Minimax Estimation for Personalized Federated Learning: An Alternative Between FedAvg and Local Training
Authors: Shuxiao Chen, Qinqing Zheng, Qi Long, Weijie J. Su
Description: Investigates minimax estimation for personalized federated learning, comparing FedAvg and local training.
Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD
Authors: Kun Yuan, Sulaiman A. Alghunaim, Xinmeng Huang
Description: Enhances decentralized SGD by addressing data heterogeneity and network topology dependence.
Multi-Consensus Decentralized Accelerated Gradient Descent
Authors: Haishan Ye, Luo Luo, Ziang Zhou, Tong Zhang
Description: Develops multi-consensus decentralized accelerated gradient descent for distributed optimization.
Accelerated Primal-Dual Mirror Dynamics for Centralized and Distributed Constrained Convex Optimization Problems
Authors: You Zhao, Xiaofeng Liao, Xing He, Mingliang Zhou, Chaojie Li
Description: Proposes accelerated primal-dual mirror dynamics for centralized and distributed convex optimization.
Beyond Spectral Gap: The Role of the Topology in Decentralized Learning
Authors: Thijs Vogels, Hadrien Hendrikx, Martin Jaggi
Description: Examines the role of network topology in decentralized learning optimization.

Bandits and Online Learning #

Papers addressing multi-armed bandits, online optimization, and regret minimization.

Adaptation to the Range in K-Armed Bandits
Authors: Hédi Hadiji, Gilles Stoltz
Description: Studies adaptation to the range in k-armed bandit problems with regret minimization.
Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection
Authors: Wenhao Li, Ningyuan Chen, L. Jeff Hong
Description: Proposes dimension reduction techniques for contextual online learning with nonparametric variable selection.
Non-Stationary Online Learning with Memory and Non-Stochastic Control
Authors: Peng Zhao, Yu-Hu Yan, Yu-Xiang Wang, Zhi-Hua Zhou
Description: Investigates non-stationary online learning with memory and non-stochastic control strategies.
Online Non-Stochastic Control with Partial Feedback
Authors: Yu-Hu Yan, Peng Zhao, Zhi-Hua Zhou
Description: Develops online non-stochastic control methods with partial feedback for optimization.
A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits
Authors: Yasin Abbasi-Yadkori, András György, Nevena Lazić
Description: Analyzes dynamic regret in non-stationary stochastic bandit problems.
A PDE Approach for Regret Bounds under Partial Monitoring
Authors: Erhan Bayraktar, Ibrahim Ekren, Xin Zhang
Description: Uses a PDE-based approach to derive regret bounds for partial monitoring in online learning.
Continuous-in-Time Limit for Bayesian Bandits
Authors: Yuhua Zhu, Zachary Izzo, Lexing Ying
Description: Explores the continuous-time limit for Bayesian bandit algorithms with theoretical guarantees.
Bandit Problems with Fidelity Rewards
Authors: Gábor Lugosi, Ciara Pike-Burke, Pierre-André Savalle
Description: Studies bandit problems with fidelity rewards, focusing on regret minimization.
Linear Partial Monitoring for Sequential Decision Making: Algorithms, Regret Bounds and Applications
Authors: Johannes Kirschner, Tor Lattimore, Andreas Krause
Description: Develops algorithms and regret bounds for linear partial monitoring in sequential decision-making.

Optimization in Reinforcement Learning #

Papers focusing on optimization techniques for reinforcement learning, including actor-critic methods and constrained RL.

Reinforcement Learning for Joint Optimization of Multiple Rewards
Authors: Mridul Agarwal, Vaneet Aggarwal
Description: Focuses on reinforcement learning for optimizing multiple rewards simultaneously.
Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints
Authors: Qinbo Bai, Vaneet Aggarwal, Ather Gattami
Description: Proposes a sample-efficient model-free algorithm for MDPs with peak constraints.
Off-Policy Actor-Critic with Emphatic Weightings
Authors: Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White
Description: Develops off-policy actor-critic methods with emphatic weightings for RL optimization.
q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity
Authors: Yanwei Jia, Xun Yu Zhou
Description: Analyzes q-learning convergence and near-optimality for MDPs with general state spaces.
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
Authors: Kaiqing Zhang, Sham M. Kakade, Tamer Basar, Lin F. Yang
Description: Studies model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.
F2A2: Flexible Fully-Decentralized Approximate Actor-Critic for Cooperative Multi-Agent Reinforcement Learning
Authors: Wenhao Li, Bo Jin, Xiangfeng Wang, Junchi Yan, Hongyuan Zha
Description: Proposes a flexible fully-decentralized approximate actor-critic method for cooperative multi-agent RL.
Adaptation Augmented Model-Based Policy Optimization
Authors: Jian Shen, Hang Lai, Minghuan Liu, Han Zhao, Yong Yu, Weinan Zhang
Description: Introduces adaptation-augmented model-based policy optimization for RL.
Single Timescale Actor-Critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees
Authors: Mo Zhou, Jianfeng Lu
Description: Develops a single timescale actor-critic method for linear quadratic regulators with convergence guarantees.
Convex Reinforcement Learning in Finite Trials
Authors: Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli
Description: Investigates convex reinforcement learning with finite trials, focusing on optimization techniques.
Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning
Authors: Zihao Li, Boyi Liu, Zhuoran Yang, Zhaoran Wang, Mengdi Wang
Description: Proposes a variational primal-dual policy optimization method for constrained RL.
Instance-Dependent Confidence and Early Stopping for Reinforcement Learning
Authors: Eric Xia, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan
Description: Develops instance-dependent confidence bounds and early stopping strategies for RL optimization.

Optimization Research Papers in JMLR Volume 23

Thu, 29 Sep 2022 00:00:00 +0000

Optimization Research Papers in JMLR Volume 23 (2022) #

This document lists papers from JMLR Volume 23 (2022) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.

Convex Optimization #

Papers addressing convex optimization problems, including sparse PCA, L1-regularized SVMs, and metric-constrained problems.

Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality
Authors: Dimitris Bertsimas, Ryan Cory-Wright, Jean Pauphilet
Description: Develops convex optimization techniques for large-scale sparse principal component analysis with certifiable near-optimal solutions.
Novel Min-Max Reformulations of Linear Inverse Problems
Authors: Mohammed Rayyan Sheriff, Debasish Chatterjee
Description: Proposes min-max reformulations for linear inverse problems using convex optimization frameworks.
New Insights for the Multivariate Square-Root Lasso
Authors: Aaron J. Molstad
Description: Analyzes the square-root Lasso in multivariate settings, focusing on its convex optimization properties.
Towards An Efficient Approach for the Nonconvex lp Ball Projection: Algorithm and Analysis
Authors: Xiangyu Yang, Jiashan Wang, Hao Wang
Description: Develops efficient algorithms for lp ball projection, addressing both convex and nonconvex aspects.
Solving L1-Regularized SVMs and Related Linear Programs: Revisiting the Effectiveness of Column and Constraint Generation
Authors: Antoine Dedieu, Rahul Mazumder, Haoyue Wang
Description: Investigates L1-regularized SVMs using convex optimization with column and constraint generation.
Extensions to the Proximal Distance Method of Constrained Optimization
Authors: Alfonso Landeros, Oscar Hernan Madrid Padilla, Hua Zhou, Kenneth Lange
Description: Extends the proximal distance method for constrained convex optimization problems.
Stochastic Subgradient for Composite Convex Optimization with Functional Constraints
Authors: Ion Necoara, Nitesh Kumar Singh
Description: Analyzes stochastic subgradient methods for composite convex optimization with functional constraints.
On Regularized Square-Root Regression Problems: Distributionally Robust Interpretation and Fast Computations
Authors: Hong T.M. Chu, Kim-Chuan Toh, Yangjing Zhang
Description: Studies regularized square-root regression with a distributionally robust perspective and efficient computational methods.
Project and Forget: Solving Large-Scale Metric Constrained Problems
Authors: Rishi Sonthalia, Anna C. Gilbert
Description: Proposes a convex optimization approach for large-scale metric-constrained problems.
Faster Randomized Interior Point Methods for Tall/Wide Linear Programs
Authors: Agniva Chowdhury, Gregory Dexter, Palma London, Haim Avron, Petros Drineas
Description: Develops randomized interior point methods for efficient optimization of tall/wide linear programs.

Nonconvex Optimization #

Papers tackling nonconvex optimization, focusing on optimality, stability, and convergence in nonsmooth and game settings.

Optimality and Stability in Non-Convex Smooth Games
Authors: Guojun Zhang, Pascal Poupart, Yaoliang Yu
Description: Analyzes optimality and stability in nonconvex smooth games with convergence guarantees.
Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization
Authors: Zhize Li, Jian Li
Description: Proposes simple and optimal stochastic gradient methods for nonsmooth, nonconvex optimization.
Oracle Complexity in Nonsmooth Nonconvex Optimization
Authors: Guy Kornowski, Ohad Shamir
Description: Studies the oracle complexity of nonsmooth nonconvex optimization problems.
Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima
Authors: Brian Swenson, Ryan Murray, H. Vincent Poor, Soummya Kar
Description: Investigates distributed SGD for nonconvex, nonsmooth optimization with convergence to local minima.

Stochastic Optimization #

Papers focusing on stochastic optimization methods, including bundle methods, zeroth-order algorithms, and adaptive techniques.

A Stochastic Bundle Method for Interpolation
Authors: Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, M. Pawan Kumar
Description: Introduces a stochastic bundle method for efficient interpolation in optimization.
On Biased Stochastic Gradient Estimation
Authors: Derek Driggs, Jingwei Liang, Carola-Bibiane Schönlieb
Description: Analyzes biases in stochastic gradient estimation and their impact on optimization performance.
Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization
Authors: Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang
Description: Proposes accelerated zeroth-order and first-order momentum methods for a range of optimization problems.
Stochastic Zeroth-Order Optimization under Nonstationarity and Nonconvexity
Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra
Description: Studies zeroth-order optimization in nonstationary and nonconvex settings.
Accelerating Adaptive Cubic Regularization of Newton’s Method via Random Sampling
Authors: Xi Chen, Bo Jiang, Tianyi Lin, Shuzhong Zhang
Description: Enhances Newton’s method with adaptive cubic regularization using random sampling.
A Momentumized, Adaptive, Dual Averaged Gradient Method
Authors: Aaron Defazio, Samy Jelassi
Description: Develops a momentum-based adaptive gradient method for stochastic optimization.
Stochastic DCA with Variance Reduction and Applications in Machine Learning
Authors: Hoai An Le Thi, Hoang Phuc Hau Luu, Hoai Minh Le, Tao Pham Dinh
Description: Introduces a stochastic difference-of-convex-functions algorithm with variance reduction for machine learning.
Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks
Authors: Alireza Fallah, Mert Gürbüzbalaban, Asuman Ozdaglar, Umut Şimşekli, Lingjiong Zhu
Description: Proposes robust stochastic gradient methods for distributed optimization in multi-agent networks.
On Acceleration for Convex Composite Minimization with Noise-Corrupted Gradients and Approximate Proximal Mapping
Authors: Qiang Zhou, Sinno Jialin Pan
Description: Addresses acceleration in convex composite minimization with noisy gradients.
Asymptotic Study of Stochastic Adaptive Algorithms in Non-Convex Landscape
Authors: Sébastien Gadat, Ioana Gavra
Description: Analyzes the asymptotic behavior of stochastic adaptive algorithms in nonconvex settings.
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration
Authors: Congliang Chen, Li Shen, Fangyu Zou, Wei Liu
Description: Studies the Adam optimizer, focusing on nonconvexity, convergence, and mini-batch acceleration.
An Efficient Sampling Algorithm for Non-Smooth Composite Potentials
Authors: Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, Peter L. Bartlett
Description: Develops an efficient sampling algorithm for nonsmooth composite potentials in stochastic optimization.
SGD with Coordinate Sampling: Theory and Practice
Authors: Rémi Leluc, François Portier
Description: Explores coordinate sampling in stochastic gradient descent with theoretical and practical insights.

Distributed/Decentralized Optimization #

Papers addressing distributed or decentralized optimization algorithms, focusing on communication efficiency and convergence.

Asymptotic Network Independence and Step-Size for a Distributed Subgradient Method
Authors: Alex Olshevsky
Description: Analyzes step-size and convergence for a distributed subgradient optimization method.
Projection-Free Distributed Online Learning with Sublinear Communication Complexity
Authors: Yuanyu Wan, Guanghui Wang, Wei-Wei Tu, Lijun Zhang
Description: Develops projection-free algorithms for distributed online learning with reduced communication complexity.
Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization
Authors: Huan Li, Zhouchen Lin, Yongchun Fang
Description: Proposes variance-reduced methods for decentralized optimization with optimal acceleration.

Submodular Optimization #

Papers focusing on submodular optimization, particularly in model selection.

Joint Continuous and Discrete Model Selection via Submodularity
Authors: Jonathan Bunton, Paulo Tabuada
Description: Uses submodularity for joint continuous and discrete model selection in optimization.

Bandits and Online Learning #

Papers addressing multi-armed bandits, online optimization, and regret minimization.

Multi-Agent Online Optimization with Delays: Asynchronicity, Adaptivity, and Optimism
Authors: Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos
Description: Studies multi-agent online optimization with delays, focusing on asynchronicity and optimism.
Online Mirror Descent and Dual Averaging: Keeping Pace in the Dynamic Case
Authors: Huang Fang, Nicholas J. A. Harvey, Victor S. Portella, Michael P. Friedlander
Description: Analyzes online mirror descent and dual averaging for dynamic online optimization.
No Weighted-Regret Learning in Adversarial Bandits with Delays
Authors: Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet
Description: Investigates regret minimization in adversarial bandits with delays.
KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints
Authors: Aurélien Garivier, Hédi Hadiji, Pierre Ménard, Gilles Stoltz
Description: Provides optimal regret bounds for stochastic bandits using KL-UCB-Switch.
Multi-Agent Multi-Armed Bandits with Limited Communication
Authors: Mridul Agarwal, Vaneet Aggarwal, Kamyar Azizzadenesheli
Description: Explores multi-agent bandits with limited communication, focusing on regret minimization.
Nonstochastic Bandits with Composite Anonymous Feedback
Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Claudio Gentile, Yishay Mansour
Description: Studies nonstochastic bandits with composite feedback, analyzing regret and optimization.
Expected Regret and Pseudo-Regret are Equivalent When the Optimal Arm is Unique
Authors: Daron Anderson, Douglas J. Leith
Description: Proves equivalence of expected regret and pseudo-regret in specific bandit settings.

Bayesian and Hyperparameter Optimization #

Papers addressing Bayesian optimization and hyperparameter tuning for efficient optimization.

SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
Authors: Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, René Sass, Frank Hutter
Description: Presents SMAC3, a versatile Bayesian optimization package for hyperparameter tuning.
Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning
Authors: Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon
Description: Uses implicit differentiation for efficient hyperparameter selection in nonsmooth convex optimization.
Auto-Sklearn 2.0: Hands-Free AutoML via Meta-Learning
Authors: Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, Frank Hutter
Description: Introduces Auto-Sklearn 2.0, leveraging meta-learning for automated hyperparameter optimization.

Optimization in Reinforcement Learning #

Papers focusing on optimization techniques for reinforcement learning, including policy gradient and value estimation.

A Generalized Projected Bellman Error for Off-Policy Value Estimation in Reinforcement Learning
Authors: Andrew Patterson, Adam White, Martha White
Description: Develops optimization methods for off-policy value estimation using a generalized projected Bellman error.
Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences
Authors: Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, Martha White
Description: Investigates greedification operators for policy optimization, focusing on KL divergences.
Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms
Authors: Yanwei Jia, Xun Yu Zhou
Description: Analyzes policy gradient and actor-critic methods for continuous-time RL optimization.
On the Convergence Rates of Policy Gradient Methods
Authors: Lin Xiao
Description: Studies convergence rates of policy gradient methods in reinforcement learning.
Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor-Critic under State Distribution Mismatch
Authors: Shangtong Zhang, Remi Tachet des Combes, Romain Laroche
Description: Examines global optimality in softmax off-policy actor-critic methods under distribution mismatch.

Optimization Research Papers in JMLR Volume 22

Wed, 29 Sep 2021 00:00:00 +0000

Optimization Research Papers in JMLR Volume 22 (2021) #

This document lists papers from JMLR Volume 22 (2021) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.

Convex Optimization #

Papers addressing convex optimization problems, including clustering, Wasserstein barycenters, sparse optimization, and bandits.

Convex Clustering: Model, Theoretical Guarantee and Efficient Algorithm
Authors: Defeng Sun, Kim-Chuan Toh, Yancheng Yuan
Description: Proposes a convex clustering model with theoretical guarantees and an efficient algorithm.
A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters
Authors: Lei Yang, Jia Li, Defeng Sun, Kim-Chuan Toh
Description: Develops a fast, globally linearly convergent algorithm for computing Wasserstein barycenters.
Wasserstein Barycenters Can Be Computed in Polynomial Time in Fixed Dimension
Authors: Jason M. Altschuler, Enric Boix-Adsera
Description: Demonstrates that Wasserstein barycenters can be computed in polynomial time for fixed dimensions.
From Low Probability to High Confidence in Stochastic Convex Optimization
Authors: Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, Junyu Zhang
Description: Analyzes methods to achieve high-confidence solutions in stochastic convex optimization.
Sparse and Smooth Signal Estimation: Convexification of L0-Formulations
Authors: Alper Atamturk, Andres Gomez, Shaoning Han
Description: Proposes convexification techniques for L0-formulations in sparse and smooth signal estimation.
Stochastic Proximal AUC Maximization
Authors: Yunwen Lei, Yiming Ying
Description: Develops stochastic proximal methods for maximizing the area under the ROC curve (AUC) in convex settings.
Sparse Convex Optimization via Adaptively Regularized Hard Thresholding
Authors: Kyriakos Axiotis, Maxim Sviridenko
Description: Introduces adaptively regularized hard thresholding for sparse convex optimization.
Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives
Authors: Antoine Dedieu, Hussein Hazimeh, Rahul Mazumder
Description: Explores continuous and mixed-integer optimization approaches for learning sparse classifiers.
First-Order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems
Authors: Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang
Description: Provides first-order convergence theory for weakly convex-weakly concave min-max problems.
Convex Geometry and Duality of Over-parameterized Neural Networks
Authors: Tolga Ergen, Mert Pilanci
Description: Analyzes convex geometry and duality in over-parameterized neural networks.
Linear Bandits on Uniformly Convex Sets
Authors: Thomas Kerdreux, Christophe Roux, Alexandre d’Aspremont, Sebastian Pokutta
Description: Studies linear bandits on uniformly convex sets, focusing on convex optimization techniques.

Nonconvex Optimization #

Papers tackling nonconvex optimization, including stochastic gradient descent, neural network training, and stability properties.

Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference
Authors: Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath
Description: Analyzes online stochastic gradient descent for nonconvex losses in high-dimensional inference.
Non-attracting Regions of Local Minima in Deep and Wide Neural Networks
Authors: Henning Petzka, Cristian Sminchisescu
Description: Investigates non-attracting regions of local minima in deep and wide neural networks.
When Does Gradient Descent with Logistic Loss Find Interpolating Two-Layer Networks?
Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett
Description: Examines conditions under which gradient descent with logistic loss finds interpolating two-layer networks.
Replica Exchange for Non-Convex Optimization
Authors: Jing Dong, Xin T. Tong
Description: Proposes replica exchange methods for nonconvex optimization problems.
Failures of Model-Dependent Generalization Bounds for Least-Norm Interpolation
Authors: Peter L. Bartlett, Philip M. Long
Description: Analyzes limitations of model-dependent generalization bounds in least-norm interpolation.
On the Stability Properties and the Optimization Landscape of Training Problems with Squared Loss for Neural Networks and General Nonlinear Conic Approximation Schemes
Authors: Constantin Christof
Description: Studies stability and optimization landscapes for neural network training with squared loss.

Stochastic Optimization #

Papers focusing on stochastic optimization methods, including momentum, Langevin dynamics, and communication-efficient algorithms.

Continuous Time Analysis of Momentum Methods
Authors: Nikola B. Kovachki, Andrew M. Stuart
Description: Provides a continuous-time analysis of momentum methods in stochastic optimization.
Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions
Authors: Yunwen Lei, Ting Hu, Ke Tang
Description: Analyzes generalization performance of multi-pass stochastic gradient descent for convex losses.
High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm
Authors: Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, Michael I. Jordan
Description: Develops an accelerated MCMC algorithm using high-order Langevin diffusion.
Path Length Bounds for Gradient Descent and Flow
Authors: Chirag Gupta, Sivaraman Balakrishnan, Aaditya Ramdas
Description: Establishes path length bounds for gradient descent and flow in stochastic optimization.
Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives
Authors: Michael Muehlebach, Michael I. Jordan
Description: Analyzes momentum-based optimization from dynamical, control-theoretic, and symplectic perspectives.
L-SVRG and L-Katyusha with Arbitrary Sampling
Authors: Xun Qian, Zheng Qu, Peter Richtárik
Description: Introduces L-SVRG and L-Katyusha algorithms with arbitrary sampling for stochastic optimization.
A Lyapunov Analysis of Accelerated Methods in Optimization
Authors: Ashia C. Wilson, Ben Recht, Michael I. Jordan
Description: Provides a Lyapunov analysis for accelerated optimization methods.
NUQSGD: Provably Communication-Efficient Data-Parallel SGD via Nonuniform Quantization
Authors: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy
Description: Proposes NUQSGD, a communication-efficient stochastic gradient descent method using nonuniform quantization.
An Inertial Newton Algorithm for Deep Learning
Authors: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels
Description: Develops an inertial Newton algorithm for deep learning optimization.
Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent
Authors: Tian Tong, Cong Ma, Yuejie Chi
Description: Proposes scaled gradient descent for accelerating ill-conditioned low-rank matrix estimation.
On ADMM in Deep Learning: Convergence and Saturation-Avoidance
Authors: Jinshan Zeng, Shao-Bo Lin, Yuan Yao, Ding-Xuan Zhou
Description: Analyzes convergence and saturation-avoidance properties of ADMM in deep learning.
A Unified Convergence Analysis for Shuffling-Type Gradient Methods
Authors: Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, Marten van Dijk
Description: Provides a unified convergence analysis for shuffling-type gradient methods.
Stochastic Online Optimization Using Kalman Recursion
Authors: Joseph de Vilmarest, Olivier Wintenberger
Description: Applies Kalman recursion to stochastic online optimization.
Expanding Boundaries of Gap Safe Screening
Authors: Cassio F. Dantas, Emmanuel Soubies, Cédric Févotte
Description: Expands gap safe screening techniques for stochastic optimization.
Consensus-Based Optimization on the Sphere: Convergence to Global Minimizers and Machine Learning
Authors: Massimo Fornasier, Lorenzo Pareschi, Hui Huang, Philippe Sünnen
Description: Develops consensus-based optimization on the sphere with applications to machine learning.
Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte Carlo
Authors: Mert Gürbüzbalaban, Xuefeng Gao, Yuanhan Hu, Lingjiong Zhu
Description: Proposes decentralized stochastic gradient Langevin dynamics and Hamiltonian Monte Carlo methods.

Distributed/Decentralized Optimization #

Papers addressing distributed or decentralized optimization algorithms, focusing on communication efficiency and scalability.

Projection-Free Decentralized Online Learning for Submodular Maximization over Time-Varying Networks
Authors: Junlong Zhu, Qingtao Wu, Mingchuan Zhang, Ruijuan Zheng, Keqin Li
Description: Develops projection-free decentralized online learning for submodular maximization over time-varying networks.
Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA
Authors: Zengfeng Huang, Xuemin Lin, Wenjie Zhang, Ying Zhang
Description: Proposes a communication-efficient distributed covariance sketch for distributed PCA.
Optimal Rates of Distributed Regression with Imperfect Kernels
Authors: Hongwei Sun, Qiang Wu
Description: Establishes optimal rates for distributed regression with imperfect kernels.
One-Shot Federated Learning: Theoretical Limits and Algorithms to Achieve Them
Authors: Saber Salehkaleybar, Arsalan Sharifnassab, S. Jamaloddin Golestani
Description: Analyzes theoretical limits and algorithms for one-shot federated learning.
Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms
Authors: Jianyu Wang, Gauri Joshi
Description: Introduces a unified framework for designing and analyzing local-update SGD algorithms.
DeEPCA: Decentralized Exact PCA with Linear Convergence Rate
Authors: Haishan Ye, Tong Zhang
Description: Develops DeEPCA, a decentralized exact PCA method with linear convergence.

Submodular Optimization #

Papers focusing on submodular optimization, particularly in experimental design.

Batch Greedy Maximization of Non-Submodular Functions: Guarantees and Applications to Experimental Design
Authors: Jayanth Jagalur-Mohan, Youssef Marzouk
Description: Provides guarantees for batch greedy maximization of non-submodular functions with applications to experimental design.

Bandits and Online Learning #

Papers addressing multi-armed bandits, online optimization, and regret minimization.

Regulating Greed Over Time in Multi-Armed Bandits
Authors: Stefano Tracà, Cynthia Rudin, Weiyu Yan
Description: Studies methods to regulate greed over time in multi-armed bandits.
Preference-Based Online Learning with Dueling Bandits: A Survey
Authors: Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, Eyke Hüllermeier
Description: Surveys preference-based online learning with dueling bandits.
On Multi-Armed Bandit Designs for Dose-Finding Trials
Authors: Maryam Aziz, Emilie Kaufmann, Marie-Karelle Riviere
Description: Explores multi-armed bandit designs for dose-finding trials.
Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
Authors: Julian Zimmert, Yevgeny Seldin
Description: Proposes Tsallis-INF, an optimal algorithm for stochastic and adversarial bandits.
Bandit Convex Optimization in Non-Stationary Environments
Authors: Peng Zhao, Guanghui Wang, Lijun Zhang, Zhi-Hua Zhou
Description: Addresses bandit convex optimization in non-stationary environments.
A Contextual Bandit Bake-off
Authors: Alberto Bietti, Alekh Agarwal, John Langford
Description: Compares contextual bandit algorithms in a comprehensive evaluation.
MetaGrad: Adaptation Using Multiple Learning Rates in Online Learning
Authors: Tim van Erven, Wouter M. Koolen, Dirk van der Hoeven
Description: Introduces MetaGrad, an adaptive online learning algorithm with multiple learning rates.
Achieving Fairness in the Stochastic Multi-Armed Bandit Problem
Authors: Vishakha Patil, Ganesh Ghalme, Vineet Nair, Y. Narahari
Description: Develops methods for achieving fairness in stochastic multi-armed bandits.
Refined Approachability Algorithms and Application to Regret Minimization with Global Costs
Authors: Joon Kwon
Description: Proposes refined approachability algorithms for regret minimization with global costs.
Bandit Learning in Decentralized Matching Markets
Authors: Lydia T. Liu, Feng Ruan, Horia Mania, Michael I. Jordan
Description: Applies bandit learning to decentralized matching markets.
Thompson Sampling Algorithms for Cascading Bandits
Authors: Zixin Zhong, Wang Chi Chueng, Vincent Y. F. Tan
Description: Develops Thompson sampling algorithms for cascading bandits.
Fast Learning for Renewal Optimization in Online Task Scheduling
Authors: Michael J. Neely
Description: Proposes fast learning methods for renewal optimization in online task scheduling.

Bayesian and Hyperparameter Optimization #

Papers addressing Bayesian optimization and hyperparameter tuning for scalable and robust optimization.

An Empirical Study of Bayesian Optimization: Acquisition Versus Partition
Authors: Erich Merrill, Alan Fern, Xiaoli Fern, Nima Dolatnia
Description: Conducts an empirical study comparing acquisition and partition strategies in Bayesian optimization.
Hyperparameter Optimization via Sequential Uniform Designs
Authors: Zebin Yang, Aijun Zhang
Description: Proposes sequential uniform designs for hyperparameter optimization.
Are We Forgetting about Compositional Optimisers in Bayesian Optimisation?
Authors: Antoine Grosnit, Alexander I. Cowen-Rivers, Rasul Tutunov, Ryan-Rhys Griffiths, Jun Wang, Haitham Bou-Ammar
Description: Explores the role of compositional optimizers in Bayesian optimization.
GIBBON: General-Purpose Information-Based Bayesian Optimisation
Authors: Henry B. Moss, David S. Leslie, Javier Gonzalez, Paul Rayson
Description: Introduces GIBBON, a general-purpose information-based Bayesian optimization framework.
On lp-Hyperparameter Learning via Bilevel Nonsmooth Optimization
Authors: Takayuki Okuno, Akiko Takeda, Akihiro Kawana, Motokazu Watanabe
Description: Studies lp-hyperparameter learning using bilevel nonsmooth optimization.

Optimization in Reinforcement Learning #

Papers focusing on optimization techniques for reinforcement learning, including policy iteration and Q-learning.

Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach
Authors: Alberto Maria Metelli, Matteo Pirotta, Daniele Calandriello, Marcello Restelli
Description: Proposes a safe policy iteration method with monotonic improvement for reinforcement learning.
On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
Authors: Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan
Description: Analyzes the optimality, approximation, and distribution shift in policy gradient methods.
Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms
Authors: Vikram Krishnamurthy, George Yin
Description: Applies Langevin dynamics to adaptive inverse reinforcement learning for stochastic gradient algorithms.
Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls
Authors: Jeongho Kim, Jaeuk Shin, Insoon Yang
Description: Develops Hamilton-Jacobi deep Q-learning for deterministic continuous-time systems.
Partial Policy Iteration for L1-Robust Markov Decision Processes
Authors: Chin Pang Ho, Marek Petrik, Wolfram Wiesemann
Description: Introduces partial policy iteration for L1-robust Markov decision processes.
Gaussian Approximation for Bias Reduction in Q-Learning
Authors: Carlo D’Eramo, Andrea Cini, Alessandro Nuara, Matteo Pirotta, Cesare Alippi, Jan Peters, Marcello Restelli
Description: Proposes Gaussian approximation techniques for bias reduction in Q-learning.

Optimization Research Papers in JMLR Volume 21

Tue, 29 Sep 2020 00:00:00 +0000

Optimization Research Papers in JMLR Volume 21 (2020) #

This document lists papers from JMLR Volume 21 (2020) that focus on optimization research, categorized by their primary themes. Each paper is numbered starting from 1 within its subsection, with a brief description of its key contributions to optimization theory, algorithms, or applications.

Convex Optimization #

Papers addressing convex optimization problems, including complexity bounds, convergence analysis, and applications in regression and assortment optimization.

A Low Complexity Algorithm with O(√T) Regret and O(1) Constraint Violations for Online Convex Optimization with Long Term Constraints
Authors: Hao Yu, Michael J. Neely
Description: Proposes a low-complexity algorithm for online convex optimization with long-term constraints, achieving O(√T) regret and O(1) constraint violations.
Lower Bounds for Parallel and Randomized Convex Optimization
Authors: Jelena Diakonikolas, Cristóbal Guzmán
Description: Establishes lower complexity bounds for parallel and randomized algorithms in convex optimization.
Discerning the Linear Convergence of ADMM for Structured Convex Optimization through the Lens of Variational Analysis
Authors: Xiaoming Yuan, Shangzhi Zeng, Jin Zhang
Description: Analyzes the linear convergence of ADMM for structured convex optimization using variational analysis.
A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints
Authors: Qihang Lin, Selvaprabu Nadarajah, Negar Soheili, Tianbao Yang
Description: Develops a data-efficient level set method for stochastic convex optimization with expectation constraints.
Conic Optimization for Quadratic Regression Under Sparse Noise
Authors: Igor Molybog, Ramtin Madani, Javad Lavaei
Description: Applies conic optimization to quadratic regression under sparse noise conditions.
Dynamic Assortment Optimization with Changing Contextual Information
Authors: Xi Chen, Yining Wang, Yuan Zhou
Description: Addresses dynamic assortment optimization with changing contextual information using convex optimization techniques.
Convex Programming for Estimation in Nonlinear Recurrent Models
Authors: Sohail Bahmani, Justin Romberg
Description: Uses convex programming for parameter estimation in nonlinear recurrent models.

Nonconvex Optimization #

Papers tackling nonconvex optimization, focusing on guarantees for local minima, variance reduction, and algorithmic advancements.

Exact Guarantees on the Absence of Spurious Local Minima for Non-negative Rank-1 Robust Principal Component Analysis
Authors: Salar Fattahi, Somayeh Sojoudi
Description: Provides exact guarantees for the absence of spurious local minima in non-negative rank-1 robust PCA.
Stochastic Nested Variance Reduction for Nonconvex Optimization
Authors: Dongruo Zhou, Pan Xu, Quanquan Gu
Description: Introduces a stochastic nested variance reduction method for nonconvex optimization.
ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization
Authors: Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, Quoc Tran-Dinh
Description: Proposes ProxSARAH, an efficient framework for stochastic composite nonconvex optimization.
Convergence Rates for the Stochastic Gradient Descent Method for Non-Convex Objective Functions
Authors: Benjamin Fehrman, Benjamin Gess, Arnulf Jentzen
Description: Analyzes convergence rates of stochastic gradient descent for nonconvex objective functions.
AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes
Authors: Rachel Ward, Xiaoxia Wu, Leon Bottou
Description: Studies sharp convergence of AdaGrad stepsize schedules in nonconvex optimization.
A Sparse Semismooth Newton Based Proximal Majorization-Minimization Algorithm for Nonconvex Square-Root-Loss Regression Problems
Authors: Peipei Tang, Chengjing Wang, Defeng Sun, Kim-Chuan Toh
Description: Develops a sparse semismooth Newton-based proximal majorization-minimization algorithm for nonconvex square-root-loss regression.

Stochastic Optimization #

Papers focusing on stochastic optimization methods, including gradient descent, variance reduction, and robustness to noise.

Convergences of Regularized Algorithms and Stochastic Gradient Methods with Random Projections
Authors: Junhong Lin, Volkan Cevher
Description: Analyzes convergence of regularized algorithms and stochastic gradient methods with random projections.
Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent
Authors: Dominic Richards, Patrick Rebeschini
Description: Studies graph-dependent implicit regularization in distributed stochastic subgradient descent.
Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions
Authors: Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis
Description: Proposes a robust asynchronous stochastic gradient-push method with asymptotically optimal performance for strongly convex functions.
On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics
Authors: Xi Chen, Simon S. Du, Xin T. Tong
Description: Investigates stationary-point hitting time and ergodicity in stochastic gradient Langevin dynamics.
Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization
Authors: Aryan Mokhtari, Hamed Hassani, Amin Karbasi
Description: Extends stochastic conditional gradient methods from convex minimization to submodular maximization.
A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning
Authors: Aryan Mokhtari, Alec Koppel, Martin Takac, Alejandro Ribeiro
Description: Introduces parallel doubly stochastic algorithms for large-scale learning.
Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers
Authors: Yao Ma, Alex Olshevsky, Csaba Szepesvari, Venkatesh Saligrama
Description: Applies gradient descent to sparse rank-one matrix completion for crowd-sourced worker aggregation.
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Authors: Junhong Lin, Volkan Cevher
Description: Establishes optimal convergence rates for distributed learning using stochastic gradient methods and spectral algorithms.
Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise
Authors: Andrei Kulunchakov, Julien Mairal
Description: Develops estimate sequences for stochastic composite optimization with variance reduction and noise robustness.
A Unified q-Memorization Framework for Asynchronous Stochastic Optimization
Authors: Bin Gu, Wenhan Xian, Zhouyuan Huo, Cheng Deng, Heng Huang
Description: Proposes a unified q-memorization framework for asynchronous stochastic optimization.
Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms
Authors: Yazhen Wang, Shang Wu
Description: Analyzes gradient descent algorithms using stochastic differential equations in statistical and computational settings.
The Error-Feedback Framework: SGD with Delayed Gradients
Authors: Sebastian U. Stich, Sai Praneeth Karimireddy
Description: Introduces an error-feedback framework for stochastic gradient descent with delayed gradients.

Distributed/Parallel Optimization #

Papers addressing distributed or parallel optimization algorithms, focusing on communication efficiency and scalability.

On the Complexity Analysis of the Primal Solutions for the Accelerated Randomized Dual Coordinate Ascent
Authors: Huan Li, Zhouchen Lin
Description: Analyzes the complexity of primal solutions for accelerated randomized dual coordinate ascent in distributed settings.
WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions
Authors: Edgar Dobriban, Yue Sheng
Description: Proposes WONDER, a weighted one-shot distributed ridge regression method for high-dimensional data.
GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning
Authors: Anis Elgabli, Jihong Park, Amrit S. Bedi, Mehdi Bennis, Vaneet Aggarwal
Description: Introduces GADMM, a fast and communication-efficient framework for distributed machine learning.
Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction
Authors: Boyue Li, Shicong Cen, Yuxin Chen, Yuejie Chi
Description: Develops communication-efficient distributed optimization with gradient tracking and variance reduction.
On Convergence of Distributed Approximate Newton Methods: Globalization, Sharper Bounds and Beyond
Authors: Xiao-Tong Yuan, Ping Li
Description: Analyzes convergence of distributed approximate Newton methods with sharper bounds and globalization techniques.

Submodular Optimization #

Papers focusing on submodular optimization, including minimization and maximization problems.

Quadratic Decomposable Submodular Function Minimization: Theory and Practice
Authors: Pan Li, Niao He, Olgica Milenkovic
Description: Studies quadratic decomposable submodular function minimization with theoretical and practical insights.
Optimal Algorithms for Continuous Non-monotone Submodular and DR-Submodular Maximization
Authors: Rad Niazadeh, Tim Roughgarden, Joshua R. Wang
Description: Develops optimal algorithms for continuous non-monotone submodular and DR-submodular maximization.

Bayesian and Hyperparameter Optimization #

Papers addressing Bayesian optimization and hyperparameter tuning for scalable and robust optimization.

Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly
Authors: Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R. Collins, Jeff Schneider, Barnabas Poczos, Eric P. Xing
Description: Introduces Dragonfly, a scalable and robust Bayesian optimization framework for hyperparameter tuning.
Distributionally Ambiguous Optimization for Batch Bayesian Optimization
Authors: Nikitas Rontsis, Michael A. Osborne, Paul J. Goulart
Description: Proposes distributionally ambiguous optimization for batch Bayesian optimization.
The Kalai-Smorodinsky Solution for Many-Objective Bayesian Optimization
Authors: Mickael Binois, Victor Picheny, Patrick Taillandier, Abderrahmane Habbal
Description: Applies the Kalai-Smorodinsky solution to many-objective Bayesian optimization.
Robust Reinforcement Learning with Bayesian Optimisation and Quadrature
Authors: Supratik Paul, Konstantinos Chatzilygeroudis, Kamil Ciosek, Jean-Baptiste Mouret, Michael A. Osborne, Shimon Whiteson
Description: Integrates Bayesian optimization and quadrature for robust reinforcement learning.

Optimization in Reinforcement Learning #

Papers focusing on optimization techniques for policy optimization and reinforcement learning.

Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems
Authors: Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, Martin J. Wainwright
Description: Develops derivative-free methods for policy optimization in linear quadratic systems with guarantees.
Expected Policy Gradients for Reinforcement Learning
Authors: Kamil Ciosek, Shimon Whiteson
Description: Introduces expected policy gradients for reinforcement learning optimization.
Importance Sampling Techniques for Policy Optimization
Authors: Alberto Maria Metelli, Matteo Papini, Nico Montali, Marcello Restelli
Description: Proposes importance sampling techniques for efficient policy optimization in reinforcement learning.

Optimization on Nam Le

Optimization Papers in JMLR Volume 26

Optimization Research Papers in JMLR Volume 25

Optimization Research Papers in JMLR Volume 25 (2024) #

Convex Optimization #

Nonconvex Optimization #

Stochastic Optimization #

Distributed/Decentralized Optimization #

Bandits and Online Learning #

Optimization in Reinforcement Learning #

Other Optimization Topics #

Ebooks & related papers on Convex Optimizations

Ebooks #

Papers #

Pre-print articles on Adagrad-variant methods

1. Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models #

2. Accelerated Parameter-Free Stochastic Optimization #

3. Universal Gradient Methods for Stochastic Convex Optimization #

Pre-print articles on Adaptive Optimization

1. A simple uniformly optimal method without line search for convex optimization #

2. Adaptive Proximal Gradient Method for Convex Optimization #

3. An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes #

4. Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance #

Pre-print articles on gradient-clipping methods

1. Why gradient clipping accelerates training: A theoretical justification for adaptivity #

2. Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees #

3. Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed #

Mathematics - Optimization

Branches of Optimization Research #

Convex Optimization #

Discrete, Combinatorial, and Integer Optimization #

Operations Research #

Meta-heuristics #

Dynamic Programming and Reinforcement Learning #

Constraint Programming #

Combinatorial Optimization #

Stochastic Optimization and Control #

Useful Resources #

Post on Optimization #

Pre-print articles on Difference-of-Convex (DC) Programming

57. Stochastic Difference-of-Convex Optimization with Momentum #

56. On the convergence rate of the boosted Difference-of-Convex Algorithm (DCA) #

55. Global solution algorithms for DC programming via polyhedral approximations of convex functions #

54. Improved Rates for Stochastic Variance-Reduced Difference-of-Convex Algorithms #

53. New Algorithms for maximizing the difference of convex functions #

52. A progressive decoupling algorithm for minimizing the difference of convex and weakly convex functions #

51. An Inexact Proximal Framework for Nonsmooth Riemannian Difference-of-Convex Optimization [arXiv:2509.08561] #

50. Tight Convergence Rates in Gradient Mapping for the Difference-of-Convex Algorithm [arXiv:2506.01791] #

49. Enforcing Fairness Where It Matters: An Approach Based on Difference-of-Convex Constraints [arXiv:2505.12530] #

48. A smoothing moving balls approximation method for a class of conic-constrained difference-of-convex optimization problems [arXiv:2505.12314] #

47. A preconditioned difference of convex functions algorithm with extrapolation and line search [arXiv:2505.11914] #

46. Contractive difference-of-convex algorithms [arXiv:2505.10800] #

45. A full splitting algorithm for structured difference-of-convex programs [arXiv:2505.02588] #

44. Optimization over Trained Neural Networks: Difference-of-Convex Algorithm and Application to Data Center Scheduling [arXiv:2503.17506] #

43. Tight Analysis of Difference-of-Convex Algorithm (DCA) Improves Convergence Rates for Proximal Gradient Descent [arXiv:2503.04486] #

42. Abstract nonautonomous difference inclusions in locally convex spaces [arXiv:2502.05184] #

41. Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees [arXiv:2502.00240] #

40. An Inexact Boosted Difference of Convex Algorithm for Nondifferentiable Functions [arXiv:2412.05697] #

39. A preconditioned second-order convex splitting algorithm with a difference of varying convex functions and line search [arXiv:2411.07661] #

38. Inertial Proximal Difference-of-Convex Algorithm with Convergent Bregman Plug-and-Play for Nonconvex Imaging [arXiv:2409.03262] #

37. Constructing Tight Quadratic Relaxations for Global Optimization: II. Underestimating Difference-of-Convex (D.C.) Functions [arXiv:2408.13058] #

36. Distributed Difference of Convex Optimization [arXiv:2407.16728] #

35. An Inexact Bregman Proximal Difference-of-Convex Algorithm with Two Types of Relative Stopping Criteria [arXiv:2406.04646] #

34. Single-Loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions [arXiv:2405.18577] #

33. Improved convergence rates for the Difference-of-Convex algorithm [arXiv:2403.16864] #

32. An Efficient Difference-of-Convex Solver for Privacy Funnel [arXiv:2403.04778] #

31. Approximation analysis for the minimization problem of difference-of-convex functions with Moreau envelopes [arXiv:2402.13461] #

30. The Boosted Difference of Convex Functions Algorithm for Value-at-Risk Constrained Portfolio Optimization [arXiv:2402.09194] #

29. A Globally Convergent Algorithm for Neural Network Parameter Optimization Based on Difference-of-Convex Functions [arXiv:2401.07936] #

28. Higher-order tensor methods for minimizing difference of convex functions [arXiv:2401.05063] #

27. Handling nonlinearities and uncertainties of fed-batch cultivations with difference of convex functions tube MPC [arXiv:2312.00847] #

26. A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces [arXiv:2310.17610] #

25. Large Convex sets in Difference sets [arXiv:2309.07527] #

24. Moreau Envelope Based Difference-of-weakly-Convex Reformulation and Algorithm for Bilevel Programs [arXiv:2306.16761] #

23. Generalized Graph Signal Sampling by Difference-of-Convex Optimization [arXiv:2306.14634] #

22. A globally convergent difference-of-convex algorithmic framework and application to log-determinant optimization problems [arXiv:2306.02001] #

21. A property of strictly convex functions which differ from each other by a constant on the boundary of their domain [arXiv:2305.12183] #

20. Local Differences Determined by Convex sets [arXiv:2304.00888] #

19. Preconditioned Algorithm for Difference of Convex Functions with applications to Graph Ginzburg-Landau Model [arXiv:2303.14495] #

18. Multi-UAV trajectory planning problem using the difference of convex function programming [arXiv:2303.07581] #