gradient descent

Let us define the standard gradient descent approach for minimizing a differentiable convex function

In a sense, gradient of a differential function $f : \mathbb{R}^d \to \mathbb{R}$ at $w$ is the vector of partial derivatives:

\nabla f(w) = (\frac{\partial f(w)}{\partial w[1]},\ldots,\frac{\partial f(w)}{\partial w[d]})

intuition

$x_{t+1} = x_t - \alpha \nabla f(x_t)$
$"\\usepackage{pgfplots}\n\\pgfplotsset{compat=1.16}\n\n\\begin{document}\n\\begin{tikzpicture}\n \\begin{scope}\n \\clip(-4,-1) rectangle (4,4);\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(20/(sin(2*\\x)+2))},{sin(\\x)*sqrt(20/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(16/(sin(2*\\x)+2))},{sin(\\x)*sqrt(16/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(12/(sin(2*\\x)+2))},{sin(\\x)*sqrt(12/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(8/(sin(2*\\x)+2))},{sin(\\x)*sqrt(8/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(4/(sin(2*\\x)+2))},{sin(\\x)*sqrt(4/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(1/(sin(2*\\x)+2))},{sin(\\x)*sqrt(1/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(0.0625/(sin(2*\\x)+2))},{sin(\\x)*sqrt(0.0625/(sin(2*\\x)+2))});\n\n \\draw[->,blue,ultra thick] (-2,3.65) to (-1.93,3);\n \\draw[->,blue,ultra thick] (-1.93,3) to (-1.75,2.4);\n \\draw[->,blue,ultra thick] (-1.75,2.4) to (-1.5,1.8);\n \\draw[->,blue,ultra thick] (-1.5,1.8) to (-1.15,1.3);\n\n \\node at (-1.4,3.8){\\scriptsize $w[0]$};\n \\node at (-1.2,3.2){\\scriptsize $w[1]$};\n \\node at (-1.05,2.6){\\scriptsize $w[2]$};\n \\node at (-0.8,2){\\scriptsize $w[3]$};\n \\node at (-0.6,1.4){\\scriptsize $w[4]$};\n \\end{scope}\n\\end{tikzpicture}\n\\end{document}"$
source code

idea

initialize $w^0$
iteratively for each t=1:
- $w^{t+1} = w^t - \alpha \nabla f(w^{(t)})$

intuition: It should convert to a local minimum depending on learning rate $\alpha$

not necessarily global minimum

But guaranteed global minimum for convex functions

calculate the gradient

\begin{aligned} E(w) &= L(w) + \lambda \text{Reg}(w) \\[8pt] L(w) &= \sum_{i} l(f_w(x^i), y^i) \\[8pt] \nabla_w (L(w)) &= \sum_{i} \nabla_w (l(f_w(x^i), y^i)) \end{aligned}

trick: split into mini-batch of gradient

\begin{aligned} \nabla_w^j &= \sum_{(x,y) \in S_j} \nabla_W (l(f_W(x), y))\\[8pt] &= \sum_{j} \nabla_W^j \end{aligned}

"\\begin{algorithm}\n\\caption{Stochastic Gradient Descent (SGD) update}\n\\begin{algorithmic}\n\\Require Learning rate schedule $\\{\\epsilon_1, \\epsilon_2, \\dots\\}$\n\\Require Initial parameter $\\theta$\n\\State $k \\gets 1$\n\\While{stopping criterion not met}\n \\State Sample a minibatch of $m$ examples from the training set $\\{x^{(1)}, \\dots, x^{(m)}\\}$ with corresponding targets $y^{(i)}$.\n \\State Compute gradient estimate: $\\hat{g} \\gets \\frac{1}{m} \\nabla_{\\theta} \\sum_{i} L\\bigl(f(x^{(i)};\\theta), y^{(i)}\\bigr)$\n \\State Apply update: $\\theta \\gets \\theta - \\epsilon_k \\hat{g}$\n \\State $k \\gets k + 1$\n\\EndWhile\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 1 Stochastic Gradient Descent (SGD) update

Require: Learning rate schedule $\{\epsilon_1, \epsilon_2, \dots\}$

Require: Initial parameter $\theta$

1: $k \gets 1$

2:while stopping criterion not met do

3:Sample a minibatch of $m$ examples from the training set $\{x^{(1)}, \dots, x^{(m)}\}$ with corresponding targets $y^{(i)}$ .

4:Compute gradient estimate: $\hat{g} \gets \frac{1}{m} \nabla_{\theta} \sum_{i} L\bigl(f(x^{(i)};\theta), y^{(i)}\bigr)$

5:Apply update: $\theta \gets \theta - \epsilon_k \hat{g}$

6: $k \gets k + 1$

7:end while

gradient descent

Étiquette

publié à

modifié à

durée

source

idea

calculate the gradient

Stochastic gradient descent

analysis of GD for Convex-Lipschitz Functions

Vous pourriez aimer ce qui suit

Liens retour