---
date: '2024-11-11'
description: accelerated gradient descent using lookahead gradient computation, achieving better convergence rates than standard momentum with optimal parameter assignments.
id: Nesterov momentum
modified: 2026-06-05 15:08:06 GMT-04:00
tags:
  - ml
  - optimization
title: Nesterov momentum
transclude:
  title: false
created: '2024-11-11'
published: '2024-11-11'
pageLayout: default
slug: thoughts/Nesterov-momentum
permalink: https://aarnphm.xyz/thoughts/Nesterov-momentum.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
See also [paper](http://www.cs.toronto.edu/%7Ehinton/absps/momentum.pdf), [[thoughts/optimization#momentum]]

idea:

- first take a step in the direction of accumulated momentum
- computes gradient at “lookahead” position,
- make the update using this gradient.

> \[!abstract\] definition
>
> For a parameter vector $\theta$, the update can be expressed as
>
> $$
> \begin{aligned}
> v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\
> \theta_{t+1} &= \theta_t - \alpha v_t
> \end{aligned}
> $$

Achieves better convergence rates

| function type            | gradient descent                   | Nesterove AG                            |
| ------------------------ | ---------------------------------- | --------------------------------------- |
| Smooth                   | $\theta(\frac{1}{T})$              | $\theta(\frac{1}{T^{2}})$               |
| Smooth & Strongly Convex | $\theta(\exp (-\frac{T}{\kappa}))$ | $\theta(\exp -\frac{T}{\sqrt{\kappa}})$ |

> \[!math\] 1. optimal assignments for parameters
>
> $$
> \alpha = \frac{1}{\lambda_{\text{max}}}, \beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}
> $$

