mlx_optimizers.DiffGrad

Contents

mlx_optimizers.DiffGrad#

class DiffGrad(learning_rate: float | Callable[[array], array], betas: List[float] = [0.9, 0.99], weight_decay: float = 0.0, eps: float = 1e-08)#

Difference of Gradients [1].

\[\begin{split}m_0 &= 0, v_0 = 0, gp_0 = 0 \\ m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ c_t &= (1 + \exp({-|gp_{t-1} - g_t|}))^{-1} \\ \alpha_t &= \eta \frac{\sqrt{1 - \beta_1^t}}{1 - \beta_2^t} \\ \theta_{t} &= \theta_{t-1} - \alpha_t \frac{m_t c_t}{\sqrt{v_t} + \epsilon}\end{split}\]

[1] Dubey, Shiv Ram, et al., 2019. DiffGrad: an optimization method for convolutional neural networks. IEEE Transactions. https://arxiv.org/abs/1909.11015 shivram1987/diffGrad

Parameters:
  • learning_rate (float or callable) – learning rate \(\eta\).

  • betas (Tuple[float, float], optional) – coefficients \((\beta_1, \beta_2)\) used for computing running averages of the gradient and its square. Default: (0.9, 0.999)

  • weight_decay – weight decay . Default: 0.0

  • eps (float, optional) – term \(\epsilon\) added to the denominator to improve numerical stability. Default: 1e-8

../_images/rosenbrock_DiffGrad.png

Methods

__init__(learning_rate[, betas, ...])

apply_single(gradient, parameter, state)

To be extended by derived classes to implement the optimizer's update.

init_single(parameter, state)

Initialize optimizer state