gradient descent
Let us define the standard gradient descent approach for minimizing a differentiable convex function
In a sense, gradient of a differential function f:Rd→R at w is the vector of partial derivatives:
∇f(w)=(∂w[1]∂f(w),…,∂w[d]∂f(w))
xt+1=xt−α∇f(xt)
idea
- initialize w0
- iteratively for each t=1:
- wt+1=wt−α∇f(w(t))
intuition: It should convert to a local minimum depending on learning rate α
not necessarily global minimum
But guaranteed global minimum for convex functions
calculate the gradient
E(w)L(w)∇w(L(w))=L(w)+λReg(w)=i∑l(fw(xi),yi)=i∑∇w(l(fw(xi),yi))
trick: split into mini-batch of gradient
∇wj=(x,y)∈Sj∑∇W(l(fW(x),y))=j∑∇Wj
See also that numerical assignment on ODEs and GD, SGD implementation in PyTorch
Algorithm 1 Stochastic Gradient Descent (SGD) update
Require: Learning rate schedule {ϵ1,ϵ2,…}
Require: Initial parameter θ
1:k←1
2:while stopping criterion not met do
3:Sample a minibatch of m examples from the training set {x(1),…,x(m)} with corresponding targets y(i).
4:Compute gradient estimate: g^←m1∇θ∑iL(f(x(i);θ),y(i))
5:Apply update: θ←θ−ϵkg^
6:k←k+1
7:end while
analysis of GD for Convex-Lipschitz Functions