profile pic
⌘ '
raccourcis clavier

See also: slides 3, slides 4, slides 5

adding bias in D-dimensions OLS

Xn×(d+1)=(x11x1d1xn1xnd1)X^{'}_{n \times (d+1)} = \begin{pmatrix} x_1^{1} & \cdots & x_1^{d} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ x_n^{1} & \cdots & x_n^{d} & 1 \end{pmatrix}

and

W(d+1)×1=(w1wdw0)W_{(d+1) \times 1} = \begin{pmatrix} w_1 \\ \vdots \\ w_d \\ w_0 \end{pmatrix}

Add an new auxiliary dimension to the input data, xd+1=1x_{d+1} = 1

Solve OLS:

minWRd×1XWY22\min\limits{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2

Gradient for f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R}

w f=[fw1fwd]\triangledown_{w} \space f = \begin{bmatrix} \frac{\partial f}{\partial w_1} \\ \vdots \\ \frac{\partial f}{\partial w_d} \\ \end{bmatrix}

Jacobian for g:RmRng: \mathbb{R}^m \rightarrow \mathbb{R}^n

w g=[g1w1g1wdgnw1gnwd]n×mu,tRdg(u)=uTv    w g=v (gradient) ARn×n;uRng(u)=uTAu    w g=(A+AT)uT (Jacobian) \begin{aligned} \triangledown_{w} \space g &= \begin{bmatrix} \frac{\partial g_1}{\partial w_1} & \cdots & \frac{\partial g_1}{\partial w_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_n}{\partial w_1} & \cdots & \frac{\partial g_n}{\partial w_d} \end{bmatrix}_{n \times m} \\ \\ &u, t \in \mathbb{R}^d \\ &\because g(u) = u^T v \implies \triangledown_{w} \space g = v \text{ (gradient) } \\ \\ &A \in \mathbb{R}^{n \times n}; u \in \mathbb{R}^n \\ &\because g(u) = u^T A u \implies \triangledown_{w} \space g = (A + A^T) u^T \text{ (Jacobian) } \end{aligned}

result

WLS=(XTX)1XTYW^{\text{LS}} = (X^T X)^{-1} X^T Y

non-linear data

Idea is to include adding an additional padding

multivariate polynomials.

question the case of multivariate polynomials

  • Assume M>>dM >> d
  • Number of terms (monomials): (Md)d\approx (\frac{M}{d})^d
  • # of training samples \approx # parameters

An example of Curse of dimensionality

overfitting.

strategies to avoid:

  • add more training data
  • L1 (Lasso) or L2 (Ridge) regularization
    • add a penalty term to the objective function
    • L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0. Loss(w)=Error+λ×w\text{Loss}(w) = \text{Error} + \lambda \times \| w \|
    • L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1 Loss(w)=Error+λ×w2\text{Loss}(w) = \text{Error} + \lambda \times w^2
  • Cross-validation
    • split data into k-fold
  • early stopping
  • dropout, see example
    • randomly selected neurons are ignored makes network less sensitive

sample complexity of learning multivariate polynomials

regularization.

L2 regularization:

minWRdXWY22+λW22\text{min}_{W \in \mathbb{R}^{d}} \| XW - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2}

Solving WRLSW^{\text{RLS}}

Solve that

WRLS=(XTX+λI)1XTYW^{\text{RLS}} = (X^T X + \lambda I)^{-1} X^T Y

Inverse exists as long as λ>0\lambda > 0

polynomial curve-fitting revisited

feature map: ϕ(x):Rd1Rd2\phi{(x)}: R^{d_1} \rightarrow R^{d_2} where d2>>d1d_{2} >> d_{1}

training:

  • W=minWϕWY22+λW22W^{*} = \min\limits{W} \| \phi W - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2}
  • W=(ϕTϕ+λI)1ϕTYW^{*} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y

prediction:

  • y^=W,ϕ(x)=WTϕ(x)\hat{y} = \langle{W^{*}, \phi{(x)}} \rangle = {W^{*}}^T \phi(x)

choices of ϕ(x)\phi(x)

  • Gaussian basis functions: ϕ(x)=exp(xμ22σ2)\phi(x) = \exp{(-\frac{\| x - \mu \|^{2}}{2\sigma^{2}})}
  • Polynomial basis functions: ϕ(x)={1,x,x2,,xd}\phi(x) = \{1, x, x^{2}, \ldots, x^{d}\}
  • Fourier basis functions: DFT, FFT

computational complexity

calculate WRLS=(ϕTϕ+λI)1ϕTYW^{\text{RLS}} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y

matmul:

  • Native: O(d3)O(d^3)
  • Strassen’s algorithm: O(d2.81)O(d^{2.81})
  • Copper-Smith-Winograd: O(d2.376)O(d^{2.376})

matrix inversion:

  • Gaussian elimination: O(d3)O(d^3)
  • Cholesky decomposition: O(d3)O(d^3) (involved around 13n3\frac{1}{3}n^3 FLOPs)

kernels

compute higher dimension inner products

K(xi,xj)=ϕ(xi),ϕ(xj)K(x^i, x^j) = \langle \phi(x^i), \phi(x^j) \rangle

Polynomial kernels of degree 2:

k(xi,xj)=(1+(xi)Txj)2=(1+xi,xj)2O(d) operationsk(x^i, x^j) = (1 + (x^i)^T x^j)^2 = (1 + \langle{x^i, x^j} \rangle)^2 \\ \\ \because O(d) \text{ operations}

degree M polynomial

k(xi,xj)=(1+(xi)Txj)Mk(x^i, x^j) = (1 + (x^i)^T x^j)^M

How many operations?

  • improved: d+logMd + \log M ops