See also SGD and ODEs
Nesterov momentum is based on On the importance of initialization and momentum in deep learning
Nesterov momentum
See also paper
idea:
- first take a step in the direction of accumulated momentum
- computes gradient at “lookahead” position,
- make the update using this gradient.
definition
For a parameter vector , the update can be expressed as
Achieves better convergence rates
Lien vers l'original
function type gradient descent Nesterove AG Smooth Smooth & Strongly Convex