See also paper

idea:

  • first take a step in the direction of accumulated momentum
  • computes gradient at “lookahead” position,
  • make the update using this gradient.

definition

For a parameter vector θ\theta, the update can be expressed as

vt=μvt1+L(θt+μvt1)θt+1=θtαvt\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}

Achieves better convergence rates

function typegradient descentNesterove AG
Smoothθ(1T)\theta(\frac{1}{T})θ(1T2)\theta(\frac{1}{T^{2}})
Smooth & Strongly Convexθ(exp(Tκ))\theta(\exp (-\frac{T}{\kappa}))θ(expTκ)\theta(\exp -\frac{T}{\sqrt{\kappa}})