mlx_optimizers.MADGRAD

Contents

mlx_optimizers.MADGRAD#

class MADGRAD(learning_rate: float | Callable[[array], array], momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06)#

Momentumized, Adaptive, Dual averaged GRADient [1].

\[\begin{split}s_0 &= 0, v_0 = 0, x_0 = \theta \\ \lambda_t &= \eta \sqrt{t + 1} \\ s_{t+1} &= s_t + \lambda_t g_t \\ v_{t+1} &= v_t + \lambda_t g_t^2 \\ z_{t+1} &= x_0 - \frac{s_{t+1}}{\sqrt[3]{v_{t+1}} + \epsilon} \\ \theta_{t+1} &= (1 - \beta) \theta_t + \beta z_{t+1}\end{split}\]

[1] Defazio, Aaron, and Samy Jelassi, 2022. A momentumized, adaptive, dual averaged gradient method. JMLR 23.144 (2022): 1-34. https://arxiv.org/abs/2101.11075 facebookresearch/madgrad

Parameters:
  • learning_rate (float or callable) – The learning rate \(\eta\).

  • momentum (float, optional) – The momentum coefficient \(\beta\). Default: 0.9

  • weight_decay (float, optional) – The weight decay coefficient. Default: 0.0

  • eps (float, optional) – The term \(\epsilon\) added to the denominator to improve numerical stability. Default: 1e-6

../_images/rosenbrock_MADGRAD.png

Methods

__init__(learning_rate[, momentum, ...])

apply_single(gradient, parameter, state)

To be extended by derived classes to implement the optimizer's update.

init_single(parameter, state)

To be extended by the children classes to implement each optimizer's state initialization.