mlx_optimizers.DiffGrad#
- class DiffGrad(learning_rate: float | Callable[[array], array], betas: List[float] = [0.9, 0.99], weight_decay: float = 0.0, eps: float = 1e-08)#
Difference of Gradients [1].
\[\begin{split}m_0 &= 0, v_0 = 0, gp_0 = 0 \\ m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ c_t &= (1 + \exp({-|gp_{t-1} - g_t|}))^{-1} \\ \alpha_t &= \eta \frac{\sqrt{1 - \beta_1^t}}{1 - \beta_2^t} \\ \theta_{t} &= \theta_{t-1} - \alpha_t \frac{m_t c_t}{\sqrt{v_t} + \epsilon}\end{split}\][1] Dubey, Shiv Ram, et al., 2019. DiffGrad: an optimization method for convolutional neural networks. IEEE Transactions. https://arxiv.org/abs/1909.11015 shivram1987/diffGrad
- Parameters:
learning_rate (float or callable) – learning rate \(\eta\).
betas (Tuple[float, float], optional) – coefficients \((\beta_1, \beta_2)\) used for computing running averages of the gradient and its square. Default:
(0.9, 0.999)
weight_decay – weight decay . Default:
0.0
eps (float, optional) – term \(\epsilon\) added to the denominator to improve numerical stability. Default:
1e-8
Methods
__init__
(learning_rate[, betas, ...])apply_single
(gradient, parameter, state)To be extended by derived classes to implement the optimizer's update.
init_single
(parameter, state)Initialize optimizer state