profile pic
⌘ '
raccourcis clavier

Let us define the standard gradient descent approach for minimizing a differentiable convex function

In a sense, gradient of a differential function f:RdRf : \mathbb{R}^d \to \mathbb{R} at ww is the vector of partial derivatives:

f(w)=(f(w)w[1],,f(w)w[d])\nabla f(w) = (\frac{\partial f(w)}{\partial w[1]},\ldots,\frac{\partial f(w)}{\partial w[d]})

intuition

xt+1=xtαf(xt)x_{t+1} = x_t - \alpha \nabla f(x_t)
"\\usepackage{pgfplots}\n\\pgfplotsset{compat=1.16}\n\n\\begin{document}\n\\begin{tikzpicture}\n \\begin{scope}\n \\clip(-4,-1) rectangle (4,4);\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(20/(sin(2*\\x)+2))},{sin(\\x)*sqrt(20/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(16/(sin(2*\\x)+2))},{sin(\\x)*sqrt(16/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(12/(sin(2*\\x)+2))},{sin(\\x)*sqrt(12/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(8/(sin(2*\\x)+2))},{sin(\\x)*sqrt(8/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(4/(sin(2*\\x)+2))},{sin(\\x)*sqrt(4/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(1/(sin(2*\\x)+2))},{sin(\\x)*sqrt(1/(sin(2*\\x)+2))});\n \\draw plot[domain=0:360] ({cos(\\x)*sqrt(0.0625/(sin(2*\\x)+2))},{sin(\\x)*sqrt(0.0625/(sin(2*\\x)+2))});\n\n \\draw[->,blue,ultra thick] (-2,3.65) to (-1.93,3);\n \\draw[->,blue,ultra thick] (-1.93,3) to (-1.75,2.4);\n \\draw[->,blue,ultra thick] (-1.75,2.4) to (-1.5,1.8);\n \\draw[->,blue,ultra thick] (-1.5,1.8) to (-1.15,1.3);\n\n \\node at (-1.4,3.8){\\scriptsize $w[0]$};\n \\node at (-1.2,3.2){\\scriptsize $w[1]$};\n \\node at (-1.05,2.6){\\scriptsize $w[2]$};\n \\node at (-0.8,2){\\scriptsize $w[3]$};\n \\node at (-0.6,1.4){\\scriptsize $w[4]$};\n \\end{scope}\n\\end{tikzpicture}\n\\end{document}"w[0]w[1]w[2]w[3]w[4]
source code

idea

  • initialize w0w^0
  • iteratively for each t=1:
    • wt+1=wtαf(w(t))w^{t+1} = w^t - \alpha \nabla f(w^{(t)})

intuition: It should convert to a local minimum depending on learning rate α\alpha

not necessarily global minimum

But guaranteed global minimum for convex functions

calculate the gradient

E(w)=L(w)+λReg(w)L(w)=il(fw(xi),yi)w(L(w))=iw(l(fw(xi),yi))\begin{aligned} E(w) &= L(w) + \lambda \text{Reg}(w) \\[8pt] L(w) &= \sum_{i} l(f_w(x^i), y^i) \\[8pt] \nabla_w (L(w)) &= \sum_{i} \nabla_w (l(f_w(x^i), y^i)) \end{aligned}

trick: split into mini-batch of gradient

wj=(x,y)SjW(l(fW(x),y))=jWj\begin{aligned} \nabla_w^j &= \sum_{(x,y) \in S_j} \nabla_W (l(f_W(x), y))\\[8pt] &= \sum_{j} \nabla_W^j \end{aligned}

Stochastic gradient descent

See also that numerical assignment on ODEs and GD, SGD implementation in PyTorch

"\\begin{algorithm}\n\\caption{Stochastic Gradient Descent (SGD) update}\n\\begin{algorithmic}\n\\Require Learning rate schedule $\\{\\epsilon_1, \\epsilon_2, \\dots\\}$\n\\Require Initial parameter $\\theta$\n\\State $k \\gets 1$\n\\While{stopping criterion not met}\n \\State Sample a minibatch of $m$ examples from the training set $\\{x^{(1)}, \\dots, x^{(m)}\\}$ with corresponding targets $y^{(i)}$.\n \\State Compute gradient estimate: $\\hat{g} \\gets \\frac{1}{m} \\nabla_{\\theta} \\sum_{i} L\\bigl(f(x^{(i)};\\theta), y^{(i)}\\bigr)$\n \\State Apply update: $\\theta \\gets \\theta - \\epsilon_k \\hat{g}$\n \\State $k \\gets k + 1$\n\\EndWhile\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 1 Stochastic Gradient Descent (SGD) update

Require: Learning rate schedule {ϵ1,ϵ2,}\{\epsilon_1, \epsilon_2, \dots\}

Require: Initial parameter θ\theta

1:k1k \gets 1

2:while stopping criterion not met do

3:Sample a minibatch of mm examples from the training set {x(1),,x(m)}\{x^{(1)}, \dots, x^{(m)}\} with corresponding targets y(i)y^{(i)}.

4:Compute gradient estimate: g^1mθiL(f(x(i);θ),y(i))\hat{g} \gets \frac{1}{m} \nabla_{\theta} \sum_{i} L\bigl(f(x^{(i)};\theta), y^{(i)}\bigr)

5:Apply update: θθϵkg^\theta \gets \theta - \epsilon_k \hat{g}

6:kk+1k \gets k + 1

7:end while

Lien vers l'original

analysis of GD for Convex-Lipschitz Functions