See also: slides 3, slides 4, slides 5

## adding bias in D-dimensions OLS

$X_{n×(d+1)}= x_{1}⋮x_{n} ⋯⋱⋯ x_{1}⋮x_{n} 1⋮1 $

and

$W_{(d+1)×1}= w_{1}⋮w_{d}w_{0} $Add an new auxiliary dimension to the input data, $x_{d+1}=1$

Solve OLS:

$minW∈R_{d×1}∥XW−Y∥_{2}$Gradient for $f:R_{d}→R$

$▽_{w}f= ∂w_{1}∂f ⋮∂w_{d}∂f $Jacobian for $g:R_{m}→R_{n}$

$▽_{w}g = ∂w_{1}∂g_{1} ⋮∂w_{1}∂g_{n} ⋯⋱⋯ ∂w_{d}∂g_{1} ⋮∂w_{d}∂g_{n} _{n×m}u,t∈R_{d}∵g(u)=u_{T}v⟹▽_{w}g=v(gradient)A∈R_{n×n};u∈R_{n}∵g(u)=u_{T}Au⟹▽_{w}g=(A+A_{T})u_{T}(Jacobian) $result

$W_{LS}=(X_{T}X)_{−1}X_{T}Y$

## non-linear data

Idea is to include adding an additional padding

## multivariate polynomials.

question the case of multivariate polynomials

- Assume $M>>d$
- Number of terms (monomials): $≈(dM )_{d}$
`#`

of training samples $≈$`#`

parameters

An example of `Curse of dimensionality`

## overfitting.

strategies to avoid:

- add more training data
- L1 (Lasso) or L2 (Ridge) regularization
- add a penalty term to the objective function
- L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0. $Loss(w)=Error+λ×∥w∥$
- L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1 $Loss(w)=Error+λ×w_{2}$

- Cross-validation
- split data into k-fold

- early stopping
- dropout, see example
- randomly selected neurons are ignored ⇒ makes network less sensitive

**sample complexity** of learning multivariate polynomials

## regularization.

L2 regularization:

$min_{W∈R_{d}}∥XW−Y∥_{2}+λ∥W∥_{2}$Solving $W_{RLS}$

Solve that

$W_{RLS}=(X_{T}X+λI)_{−1}X_{T}Y$Inverse exists as long as $λ>0$

## polynomial curve-fitting revisited

feature map: $ϕ(x):R_{d_{1}}→R_{d_{2}}$ where $d_{2}>>d_{1}$

training:

- $W_{∗}=minW∥ϕW−Y∥_{2}+λ∥W∥_{2}$
- $W_{∗}=(ϕ_{T}ϕ+λI)_{−1}ϕ_{T}Y$

prediction:

- $y^ =⟨W_{∗},ϕ(x)⟩=W_{∗}_{T}ϕ(x)$

choices of $ϕ(x)$

- Gaussian basis functions: $ϕ(x)=exp(−2σ_{2}∥x−μ∥_{2} )$
- Polynomial basis functions: $ϕ(x)={1,x,x_{2},…,x_{d}}$
- Fourier basis functions: DFT, FFT

## computational complexity

calculate $W_{RLS}=(ϕ_{T}ϕ+λI)_{−1}ϕ_{T}Y$

matmul:

- Native: $O(d_{3})$
- Strassen’s algorithm: $O(d_{2.81})$
- Copper-Smith-Winograd: $O(d_{2.376})$

matrix inversion:

- Gaussian elimination: $O(d_{3})$
- Cholesky decomposition: $O(d_{3})$ (involved around $31 n_{3}$ FLOPs)

## kernels

compute higher dimension inner products

$K(x_{i},x_{j})=⟨ϕ(x_{i}),ϕ(x_{j})⟩$Polynomial kernels of degree 2:

$k(x_{i},x_{j})=(1+(x_{i})_{T}x_{j})_{2}=(1+⟨x_{i},x_{j}⟩)_{2}∵O(d)operations$degree M polynomial

$k(x_{i},x_{j})=(1+(x_{i})_{T}x_{j})_{M}$

How many operations?

- improved: $d+gM$ ops