See also SGD and ODEs

Nesterov momentum is based on On the importance of initialization and momentum in deep learning

"\\begin{algorithm}\n\\caption{SGD}\n\\begin{algorithmic}\n\\State \\textbf{input:} $\\gamma$ (lr), $\\theta_0$ (params), $f(\\theta)$ (objective), $\\lambda$ (weight decay),\n\\State $\\mu$ (momentum), $\\tau$ (dampening), nesterov, maximize\n\\For{$t = 1$ to $...$}\n \\State $g_t \\gets \\nabla_\\theta f_t(\\theta_{t-1})$\n \\If{$\\lambda \\neq 0$}\n \\State $g_t \\gets g_t + \\lambda\\theta_{t-1}$\n \\EndIf\n \\If{$\\mu \\neq 0$}\n \\If{$t > 1$}\n \\State $b_t \\gets \\mu b_{t-1} + (1-\\tau)g_t$\n \\Else\n \\State $b_t \\gets g_t$\n \\EndIf\n \\If{$\\text{nesterov}$}\n \\State $g_t \\gets g_t + \\mu b_t$\n \\Else\n \\State $g_t \\gets b_t$\n \\EndIf\n \\EndIf\n \\If{$\\text{maximize}$}\n \\State $\\theta_t \\gets \\theta_{t-1} + \\gamma g_t$\n \\Else\n \\State $\\theta_t \\gets \\theta_{t-1} - \\gamma g_t$\n \\EndIf\n\\EndFor\n\\State \\textbf{return} $\\theta_t$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 SGD

1:input: γ\gamma (lr), θ0\theta_0 (params), f(θ)f(\theta) (objective), λ\lambda (weight decay),

2:μ\mu (momentum), τ\tau (dampening), nesterov, maximize

3:for t=1t = 1 to ...... do

4:gtθft(θt1)g_t \gets \nabla_\theta f_t(\theta_{t-1})

5:if λ0\lambda \neq 0 then

6:gtgt+λθt1g_t \gets g_t + \lambda\theta_{t-1}

7:end if

8:if μ0\mu \neq 0 then

9:if t>1t > 1 then

10:btμbt1+(1τ)gtb_t \gets \mu b_{t-1} + (1-\tau)g_t

11:else

12:btgtb_t \gets g_t

13:end if

14:if nesterov\text{nesterov} then

15:gtgt+μbtg_t \gets g_t + \mu b_t

16:else

17:gtbtg_t \gets b_t

18:end if

19:end if

20:if maximize\text{maximize} then

21:θtθt1+γgt\theta_t \gets \theta_{t-1} + \gamma g_t

22:else

23:θtθt1γgt\theta_t \gets \theta_{t-1} - \gamma g_t

24:end if

25:end for

26:return θt\theta_t

Nesterov momentum

See also paper

idea:

  • first take a step in the direction of accumulated momentum
  • computes gradient at “lookahead” position,
  • make the update using this gradient.

definition

For a parameter vector θ\theta, the update can be expressed as

vt=μvt1+L(θt+μvt1)θt+1=θtαvt\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}

Achieves better convergence rates

function typegradient descentNesterove AG
Smoothθ(1T)\theta(\frac{1}{T})θ(1T2)\theta(\frac{1}{T^{2}})
Smooth & Strongly Convexθ(exp(Tκ))\theta(\exp (-\frac{T}{\kappa}))θ(expTκ)\theta(\exp -\frac{T}{\sqrt{\kappa}})
Lien vers l'original