Bias and intercept See also: slides 3 , slides 4 , slides 5
adding bias in D-dimensions OLS
X n × ( d + 1 ) ′ = ( x 1 1 ⋯ x 1 d 1 ⋮ ⋱ ⋮ ⋮ x n 1 ⋯ x n d 1 ) X^{'}_{n \times (d+1)} = \begin{pmatrix}
x_1^{1} & \cdots & x_1^{d} & 1 \\
\vdots & \ddots & \vdots & \vdots \\
x_n^{1} & \cdots & x_n^{d} & 1
\end{pmatrix} X n × ( d + 1 ) ′ = x 1 1 ⋮ x n 1 ⋯ ⋱ ⋯ x 1 d ⋮ x n d 1 ⋮ 1
and
W ( d + 1 ) × 1 = ( w 1 ⋮ w d w 0 ) W_{(d+1) \times 1} = \begin{pmatrix}
w_1 \\
\vdots \\
w_d \\
w_0
\end{pmatrix} W ( d + 1 ) × 1 = w 1 ⋮ w d w 0
Add an new auxiliary dimension to the input data, x d + 1 = 1 x_{d+1} = 1 x d + 1 = 1
Solve OLS:
min W ∈ R d × 1 ∥ X W − Y ∥ 2 2 \min\limits{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2 min W ∈ R d × 1 ∥ X W − Y ∥ 2 2
Gradient for f : R d → R f: \mathbb{R}^d \rightarrow \mathbb{R} f : R d → R
▽ w f = [ ∂ f ∂ w 1 ⋮ ∂ f ∂ w d ] \triangledown_{w} \space f = \begin{bmatrix}
\frac{\partial f}{\partial w_1} \\
\vdots \\
\frac{\partial f}{\partial w_d} \\
\end{bmatrix} ▽ w f = ∂ w 1 ∂ f ⋮ ∂ w d ∂ f
Jacobian for g : R m → R n g: \mathbb{R}^m \rightarrow \mathbb{R}^n g : R m → R n
▽ w g = [ ∂ g 1 ∂ w 1 ⋯ ∂ g 1 ∂ w d ⋮ ⋱ ⋮ ∂ g n ∂ w 1 ⋯ ∂ g n ∂ w d ] n × m u , t ∈ R d ∵ g ( u ) = u T v ⟹ ▽ w g = v (gradient) A ∈ R n × n ; u ∈ R n ∵ g ( u ) = u T A u ⟹ ▽ w g = ( A + A T ) u T (Jacobian) \begin{aligned}
\triangledown_{w} \space g &= \begin{bmatrix}
\frac{\partial g_1}{\partial w_1} & \cdots & \frac{\partial g_1}{\partial w_d} \\
\vdots & \ddots & \vdots \\
\frac{\partial g_n}{\partial w_1} & \cdots & \frac{\partial g_n}{\partial w_d}
\end{bmatrix}_{n \times m} \\
\\
&u, t \in \mathbb{R}^d \\
&\because g(u) = u^T v \implies \triangledown_{w} \space g = v \text{ (gradient) } \\
\\
&A \in \mathbb{R}^{n \times n}; u \in \mathbb{R}^n \\
&\because g(u) = u^T A u \implies \triangledown_{w} \space g = (A + A^T) u^T \text{ (Jacobian) }
\end{aligned} ▽ w g = ∂ w 1 ∂ g 1 ⋮ ∂ w 1 ∂ g n ⋯ ⋱ ⋯ ∂ w d ∂ g 1 ⋮ ∂ w d ∂ g n n × m u , t ∈ R d ∵ g ( u ) = u T v ⟹ ▽ w g = v (gradient) A ∈ R n × n ; u ∈ R n ∵ g ( u ) = u T A u ⟹ ▽ w g = ( A + A T ) u T (Jacobian)
W LS = ( X T X ) − 1 X T Y W^{\text{LS}} = (X^T X)^{-1} X^T Y W LS = ( X T X ) − 1 X T Y
non-linear data
Idea is to include adding an additional padding
multivariate polynomials.
question the case of multivariate polynomials
Assume M > > d M >> d M >> d
Number of terms (monomials): ≈ ( M d ) d \approx (\frac{M}{d})^d ≈ ( d M ) d
#
of training samples ≈ \approx ≈ #
parameters
An example of Curse of dimensionality
overfitting.
strategies to avoid:
add more training data
L1 (Lasso) or L2 (Ridge) regularization
add a penalty term to the objective function
L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0.
Loss ( w ) = Error + λ × ∥ w ∥ \text{Loss}(w) = \text{Error} + \lambda \times \| w \| Loss ( w ) = Error + λ × ∥ w ∥
L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1
Loss ( w ) = Error + λ × w 2 \text{Loss}(w) = \text{Error} + \lambda \times w^2 Loss ( w ) = Error + λ × w 2
Cross-validation
early stopping
dropout, see example
randomly selected neurons are ignored ⇒ makes network less sensitive
sample complexity of learning multivariate polynomials
regularization.
L2 regularization:
min W ∈ R d ∥ X W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 \text{min}_{W \in \mathbb{R}^{d}} \| XW - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2} min W ∈ R d ∥ X W − Y ∥ 2 2 + λ ∥ W ∥ 2 2
Solving W RLS W^{\text{RLS}} W RLS
Solve that
W RLS = ( X T X + λ I ) − 1 X T Y W^{\text{RLS}} = (X^T X + \lambda I)^{-1} X^T Y W RLS = ( X T X + λ I ) − 1 X T Y
Inverse exists as long as λ > 0 \lambda > 0 λ > 0
polynomial curve-fitting revisited
feature map: ϕ ( x ) : R d 1 → R d 2 \phi{(x)}: R^{d_1} \rightarrow R^{d_2} ϕ ( x ) : R d 1 → R d 2 where d 2 > > d 1 d_{2} >> d_{1} d 2 >> d 1
training:
W ∗ = min W ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 W^{*} = \min\limits{W} \| \phi W - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2} W ∗ = min W ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2
W ∗ = ( ϕ T ϕ + λ I ) − 1 ϕ T Y W^{*} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y W ∗ = ( ϕ T ϕ + λ I ) − 1 ϕ T Y
prediction:
y ^ = ⟨ W ∗ , ϕ ( x ) ⟩ = W ∗ T ϕ ( x ) \hat{y} = \langle{W^{*}, \phi{(x)}} \rangle = {W^{*}}^T \phi(x) y ^ = ⟨ W ∗ , ϕ ( x ) ⟩ = W ∗ T ϕ ( x )
choices of ϕ ( x ) \phi(x) ϕ ( x )
Gaussian basis functions: ϕ ( x ) = exp ( − ∥ x − μ ∥ 2 2 σ 2 ) \phi(x) = \exp{(-\frac{\| x - \mu \|^{2}}{2\sigma^{2}})} ϕ ( x ) = exp ( − 2 σ 2 ∥ x − μ ∥ 2 )
Polynomial basis functions: ϕ ( x ) = { 1 , x , x 2 , … , x d } \phi(x) = \{1, x, x^{2}, \ldots, x^{d}\} ϕ ( x ) = { 1 , x , x 2 , … , x d }
Fourier basis functions: DFT, FFT
computational complexity
calculate W RLS = ( ϕ T ϕ + λ I ) − 1 ϕ T Y W^{\text{RLS}} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y W RLS = ( ϕ T ϕ + λ I ) − 1 ϕ T Y
matmul:
Native: O ( d 3 ) O(d^3) O ( d 3 )
Strassen’s algorithm: O ( d 2.81 ) O(d^{2.81}) O ( d 2.81 )
Copper-Smith-Winograd: O ( d 2.376 ) O(d^{2.376}) O ( d 2.376 )
matrix inversion:
Gaussian elimination: O ( d 3 ) O(d^3) O ( d 3 )
Cholesky decomposition : O ( d 3 ) O(d^3) O ( d 3 ) (involved around 1 3 n 3 \frac{1}{3}n^3 3 1 n 3 FLOPs)
kernels
compute higher dimension inner products
K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ K(x^i, x^j) = \langle \phi(x^i), \phi(x^j) \rangle K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j )⟩
Polynomial kernels of degree 2:
k ( x i , x j ) = ( 1 + ( x i ) T x j ) 2 = ( 1 + ⟨ x i , x j ⟩ ) 2 ∵ O ( d ) operations k(x^i, x^j) = (1 + (x^i)^T x^j)^2 = (1 + \langle{x^i, x^j} \rangle)^2
\\
\\
\because O(d) \text{ operations} k ( x i , x j ) = ( 1 + ( x i ) T x j ) 2 = ( 1 + ⟨ x i , x j ⟩ ) 2 ∵ O ( d ) operations
k ( x i , x j ) = ( 1 + ( x i ) T x j ) M k(x^i, x^j) = (1 + (x^i)^T x^j)^M k ( x i , x j ) = ( 1 + ( x i ) T x j ) M
How many operations?
improved: d + log M d + \log M d + log M ops