Supervised machine learning See also: book
probability density function
if X X X is a random variable, the probability density function (pdf) is a function f ( x ) f(x) f ( x ) such that:
P ( a ≤ X ≤ b ) = ∫ a b f ( x ) d x P(a \leq X \leq b) = \int_{a}^{b} f(x) dx P ( a ≤ X ≤ b ) = ∫ a b f ( x ) d x
if distribution of X X X is uniform over [ a , b ] [a,b] [ a , b ] , then f ( x ) = 1 b − a f(x) = \frac{1}{b-a} f ( x ) = b − a 1
curve fitting
how do we fit a distribution of data over a curve?
Given a set of n n n data points S = { ( x i , y i ) } n = 1 n S=\set{(x^i, y^i)}^{n}_{n=1} S = { ( x i , y i ) } n = 1 n
x ∈ R d x \in \mathbb{R}^{d} x ∈ R d
y ∈ R y \in \mathbb{R} y ∈ R (or R k \mathbb{R}^{k} R k )
In the case of 1-D ordinary least square, the problems equates find a , b ∈ R a,b \in \mathbb{R} a , b ∈ R to minimize min a , b ∑ i = 1 n ( a x i + b − y i ) 2 \min\limits_{a,b} \sum_{i=1}^{n} (ax^i + b - y^i)^2 a , b min ∑ i = 1 n ( a x i + b − y i ) 2
minimize f ( a , b ) = ∑ i = 1 n ( a x i + b − y i ) 2 f(a, b) = \sum^{n}_{i=1}{(ax^i + b - y^i)^2} f ( a , b ) = ∑ i = 1 n ( a x i + b − y i ) 2
∂ f ∂ a = 2 ∑ i = 1 n ( a x i + b − y i ) x i = 0 ∂ f ∂ b = 2 ∑ i = 1 n ( a x i + b − y i ) = 0 ⟹ 2 n b + 2 a ∑ i = 1 n x i = 2 ∑ i = 1 n y i ⟹ b + a x ‾ = y ‾ ⟹ b = y ‾ − a x ‾ ∵ y ‾ = 1 n ∑ i = 1 n y i x ‾ = 1 n ∑ i = 1 n x i \begin{aligned}
\frac{\partial f}{\partial a} &= 2 \sum^{n}_{i=1}{(ax^i + b - y^i)} x^{i} = 0 \\
\frac{\partial f}{\partial b} &= 2 \sum^{n}_{i=1}{(ax^i + b - y^i)} = 0 \\
\\
\implies 2nb + 2a \sum_{i=1}^{n} x^i &= 2 \sum_{i=1}^{n} y^i \\
\implies b + a \overline{x} &= \overline{y} \\
\implies b &= \overline{y} - a \overline{x} \\
\\
\because \overline{y} &= \frac{1}{n} \sum_{i=1}^{n} y^{i} \\
\overline{x} &= \frac{1}{n} \sum_{i=1}^{n} x^{i}
\end{aligned} ∂ a ∂ f ∂ b ∂ f ⟹ 2 nb + 2 a i = 1 ∑ n x i ⟹ b + a x ⟹ b ∵ y x = 2 i = 1 ∑ n ( a x i + b − y i ) x i = 0 = 2 i = 1 ∑ n ( a x i + b − y i ) = 0 = 2 i = 1 ∑ n y i = y = y − a x = n 1 i = 1 ∑ n y i = n 1 i = 1 ∑ n x i
optimal solution
a = x y ‾ − x ‾ ⋅ y ‾ x 2 ‾ − ( x ‾ ) 2 = COV ( x , y ) Var ( x ) b = y ‾ − a x ‾ \begin{aligned}
a &= \frac{\overline{xy} - \overline{x} \cdot \overline{y}}{\overline{x^2} - (\overline{x})^2} = \frac{\text{COV}(x,y)}{\text{Var}(x)} \\
b &= \overline{y} - a \overline{x}
\end{aligned} a b = x 2 − ( x ) 2 x y − x ⋅ y = Var ( x ) COV ( x , y ) = y − a x
where x ‾ = 1 N ∑ x i \overline{x} = \frac{1}{N} \sum{x^i} x = N 1 ∑ x i , y ‾ = 1 N ∑ y i \overline{y} = \frac{1}{N} \sum{y^i} y = N 1 ∑ y i , x y ‾ = 1 N ∑ x i y i \overline{xy} = \frac{1}{N} \sum{x^i y^i} x y = N 1 ∑ x i y i , x 2 ‾ = 1 N ∑ ( x i ) 2 \overline{x^2} = \frac{1}{N} \sum{(x^i)^2} x 2 = N 1 ∑ ( x i ) 2
hyperplane
y ^ = w 0 + ∑ j = 1 d w j x j ∵ w 0 : the y-intercept (bias) \begin{aligned}
\hat{y} &= w_{0} + \sum_{j=1}^{d}{w_j x_j}\\[12pt]
&\because w_0: \text{the y-intercept (bias)}
\end{aligned} y ^ = w 0 + j = 1 ∑ d w j x j ∵ w 0 : the y-intercept (bias)
Homogeneous hyperplane:
w 0 = 0 y ^ = ∑ j = 1 d w j x j = ⟨ w , x ⟩ = w T x \begin{aligned}
w_{0} & = 0 \\
\hat{y} &= \sum_{j=1}^{d}{w_j x_j} = \langle{w,x} \rangle \\
&= w^Tx
\end{aligned} w 0 y ^ = 0 = j = 1 ∑ d w j x j = ⟨ w , x ⟩ = w T x
Matrix form OLS:
X n × d = ( x 1 1 ⋯ x d 1 ⋮ ⋱ ⋮ x 1 n ⋯ x d n ) , Y n × 1 = ( y 1 ⋮ y n ) , W d × 1 = ( w 1 ⋮ w d ) X_{n\times d} = \begin{pmatrix}
x_1^1 & \cdots & x_d^1 \\
\vdots & \ddots & \vdots \\
x_1^n & \cdots & x_d^n
\end{pmatrix}, Y_{n\times 1} = \begin{pmatrix}
y^1 \\
\vdots \\
y^n
\end{pmatrix}, W_{d\times 1} = \begin{pmatrix}
w_1 \\
\vdots \\
w_d
\end{pmatrix} X n × d = x 1 1 ⋮ x 1 n ⋯ ⋱ ⋯ x d 1 ⋮ x d n , Y n × 1 = y 1 ⋮ y n , W d × 1 = w 1 ⋮ w d
Obj : ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( ⟨ w , x i ⟩ − y i ) 2 Def : Δ = ( Δ 1 ⋮ Δ n ) = ( x 1 1 ⋯ x d 1 ⋮ ⋱ ⋮ x 1 n ⋯ x d n ) ( w 1 ⋮ w d ) − ( y 1 ⋮ y n ) = ( y ^ 1 − y 1 ⋮ y ^ n − y n ) \begin{aligned}
\text{Obj} &: \sum_{i=1}^n (\hat{y}^i - y^i)^2 = \sum_{i=1}^n (\langle w, x^i \rangle - y^i)^2 \\
&\\\
\text{Def} &:
\Delta = \begin{pmatrix}
\Delta_1 \\
\vdots \\
\Delta_n
\end{pmatrix} = \begin{pmatrix}
x_1^1 & \cdots & x_d^1 \\
\vdots & \ddots & \vdots \\
x_1^n & \cdots & x_d^n
\end{pmatrix} \begin{pmatrix}
w_1 \\
\vdots \\
w_d
\end{pmatrix} - \begin{pmatrix}
y^1 \\
\vdots \\
y^n
\end{pmatrix} = \begin{pmatrix}
\hat{y}^1 - y^1 \\
\vdots \\
\hat{y}^n - y^n
\end{pmatrix}
\end{aligned} Obj Def : i = 1 ∑ n ( y ^ i − y i ) 2 = i = 1 ∑ n (⟨ w , x i ⟩ − y i ) 2 : Δ = Δ 1 ⋮ Δ n = x 1 1 ⋮ x 1 n ⋯ ⋱ ⋯ x d 1 ⋮ x d n w 1 ⋮ w d − y 1 ⋮ y n = y ^ 1 − y 1 ⋮ y ^ n − y n
min W ∈ R d × 1 ∥ X W − Y ∥ 2 2 \min\limits_{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2 W ∈ R d × 1 min ∥ X W − Y ∥ 2 2
W LS = ( X T X ) − 1 X T Y W^{\text{LS}} = (X^T X)^{-1}{X^T Y} W LS = ( X T X ) − 1 X T Y
Example:
y ^ = w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 \hat{y} = w_{0} + w_{1} \cdot x_{1} + w_{2} \cdot x_{2} y ^ = w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2
With
X n × 2 = ( x 1 1 x 2 1 x 1 2 x 2 2 x 1 3 x 2 3 ) X_{n \times 2} = \begin{pmatrix}
x^{1}_{1} & x^{1}_{2} \\
x^{2}_{1} & x^{2}_{2} \\
x^{3}_{1} & x^{3}_{2}
\end{pmatrix} X n × 2 = x 1 1 x 1 2 x 1 3 x 2 1 x 2 2 x 2 3
and
X n × 3 ′ = ( x 1 1 x 2 1 1 x 1 2 x 2 2 1 x 1 3 x 2 3 1 ) X^{'}_{n \times 3} = \begin{pmatrix}
x^{1}_{1} & x^{1}_{2} & 1 \\
x^{2}_{1} & x^{2}_{2} & 1 \\
x^{3}_{1} & x^{3}_{2} & 1
\end{pmatrix} X n × 3 ′ = x 1 1 x 1 2 x 1 3 x 2 1 x 2 2 x 2 3 1 1 1
With
W = ( w 1 w 2 ) W = \begin{pmatrix}
w_1 \\
w_2
\end{pmatrix} W = ( w 1 w 2 )
and
W ′ = ( w 1 w 2 w 0 ) W^{'} = \begin{pmatrix}
w_1 \\
w_2 \\
w_0
\end{pmatrix} W ′ = w 1 w 2 w 0
thus
X ′ × W = ( w 0 + ∑ w i × x i 1 ⋮ w 0 + ∑ w i × x i n ) X^{'} \times W = \begin{pmatrix}
w_0 + \sum{w_i \times x_i^{1}} \\
\vdots \\
w_0 + \sum{w_i \times x_i^{n}}
\end{pmatrix} X ′ × W = w 0 + ∑ w i × x i 1 ⋮ w 0 + ∑ w i × x i n
adding bias in D-dimensions OLS
X n × ( d + 1 ) ′ = ( x 1 1 ⋯ x 1 d 1 ⋮ ⋱ ⋮ ⋮ x n 1 ⋯ x n d 1 ) X^{'}_{n \times (d+1)} = \begin{pmatrix}
x_1^{1} & \cdots & x_1^{d} & 1 \\
\vdots & \ddots & \vdots & \vdots \\
x_n^{1} & \cdots & x_n^{d} & 1
\end{pmatrix} X n × ( d + 1 ) ′ = x 1 1 ⋮ x n 1 ⋯ ⋱ ⋯ x 1 d ⋮ x n d 1 ⋮ 1
and
W ( d + 1 ) × 1 = ( w 1 ⋮ w d w 0 ) W_{(d+1) \times 1} = \begin{pmatrix}
w_1 \\
\vdots \\
w_d \\
w_0
\end{pmatrix} W ( d + 1 ) × 1 = w 1 ⋮ w d w 0
Add an new auxiliary dimension to the input data, x d + 1 = 1 x_{d+1} = 1 x d + 1 = 1
Solve OLS:
min W ∈ R d × 1 ∥ X W − Y ∥ 2 2 \min\limits{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2 min W ∈ R d × 1 ∥ X W − Y ∥ 2 2
Gradient for f : R d → R f: \mathbb{R}^d \rightarrow \mathbb{R} f : R d → R
▽ w f = [ ∂ f ∂ w 1 ⋮ ∂ f ∂ w d ] \triangledown_{w} \space f = \begin{bmatrix}
\frac{\partial f}{\partial w_1} \\
\vdots \\
\frac{\partial f}{\partial w_d} \\
\end{bmatrix} ▽ w f = ∂ w 1 ∂ f ⋮ ∂ w d ∂ f
Jacobian for g : R m → R n g: \mathbb{R}^m \rightarrow \mathbb{R}^n g : R m → R n
▽ w g = [ ∂ g 1 ∂ w 1 ⋯ ∂ g 1 ∂ w d ⋮ ⋱ ⋮ ∂ g n ∂ w 1 ⋯ ∂ g n ∂ w d ] n × m u , t ∈ R d ∵ g ( u ) = u T v ⟹ ▽ w g = v (gradient) A ∈ R n × n ; u ∈ R n ∵ g ( u ) = u T A u ⟹ ▽ w g = ( A + A T ) u T (Jacobian) \begin{aligned}
\triangledown_{w} \space g &= \begin{bmatrix}
\frac{\partial g_1}{\partial w_1} & \cdots & \frac{\partial g_1}{\partial w_d} \\
\vdots & \ddots & \vdots \\
\frac{\partial g_n}{\partial w_1} & \cdots & \frac{\partial g_n}{\partial w_d}
\end{bmatrix}_{n \times m} \\
\\
&u, t \in \mathbb{R}^d \\
&\because g(u) = u^T v \implies \triangledown_{w} \space g = v \text{ (gradient) } \\
\\
&A \in \mathbb{R}^{n \times n}; u \in \mathbb{R}^n \\
&\because g(u) = u^T A u \implies \triangledown_{w} \space g = (A + A^T) u^T \text{ (Jacobian) }
\end{aligned} ▽ w g = ∂ w 1 ∂ g 1 ⋮ ∂ w 1 ∂ g n ⋯ ⋱ ⋯ ∂ w d ∂ g 1 ⋮ ∂ w d ∂ g n n × m u , t ∈ R d ∵ g ( u ) = u T v ⟹ ▽ w g = v (gradient) A ∈ R n × n ; u ∈ R n ∵ g ( u ) = u T A u ⟹ ▽ w g = ( A + A T ) u T (Jacobian)
W LS = ( X T X ) − 1 X T Y W^{\text{LS}} = (X^T X)^{-1} X^T Y W LS = ( X T X ) − 1 X T Y
overfitting.
strategies to avoid:
add more training data
L1 (Lasso) or L2 (Ridge) regularization
add a penalty term to the objective function
L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0.
Loss ( w ) = Error + λ × ∥ w ∥ \text{Loss}(w) = \text{Error} + \lambda \times \| w \| Loss ( w ) = Error + λ × ∥ w ∥
L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1
Loss ( w ) = Error + λ × w 2 \text{Loss}(w) = \text{Error} + \lambda \times w^2 Loss ( w ) = Error + λ × w 2
Cross-validation
early stopping
dropout, see example
randomly selected neurons are ignored ⇒ makes network less sensitive
sample complexity of learning multivariate polynomials
regularization.
L2 regularization:
min W ∈ R d ∥ X W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 \text{min}_{W \in \mathbb{R}^{d}} \| XW - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2} min W ∈ R d ∥ X W − Y ∥ 2 2 + λ ∥ W ∥ 2 2
Solving W RLS W^{\text{RLS}} W RLS
Solve that
W RLS = ( X T X + λ I ) − 1 X T Y W^{\text{RLS}} = (X^T X + \lambda I)^{-1} X^T Y W RLS = ( X T X + λ I ) − 1 X T Y
Inverse exists as long as λ > 0 \lambda > 0 λ > 0
polynomial curve-fitting revisited
feature map: ϕ ( x ) : R d 1 → R d 2 \phi{(x)}: R^{d_1} \rightarrow R^{d_2} ϕ ( x ) : R d 1 → R d 2 where d 2 > > d 1 d_{2} >> d_{1} d 2 >> d 1
training:
W ∗ = min W ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 W^{*} = \min\limits{W} \| \phi W - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2} W ∗ = min W ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2
W ∗ = ( ϕ T ϕ + λ I ) − 1 ϕ T Y W^{*} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y W ∗ = ( ϕ T ϕ + λ I ) − 1 ϕ T Y
prediction:
y ^ = ⟨ W ∗ , ϕ ( x ) ⟩ = W ∗ T ϕ ( x ) \hat{y} = \langle{W^{*}, \phi{(x)}} \rangle = {W^{*}}^T \phi(x) y ^ = ⟨ W ∗ , ϕ ( x ) ⟩ = W ∗ T ϕ ( x )
choices of ϕ ( x ) \phi(x) ϕ ( x )
Gaussian basis functions: ϕ ( x ) = exp ( − ∥ x − μ ∥ 2 2 σ 2 ) \phi(x) = \exp{(-\frac{\| x - \mu \|^{2}}{2\sigma^{2}})} ϕ ( x ) = exp ( − 2 σ 2 ∥ x − μ ∥ 2 )
Polynomial basis functions: ϕ ( x ) = { 1 , x , x 2 , … , x d } \phi(x) = \{1, x, x^{2}, \ldots, x^{d}\} ϕ ( x ) = { 1 , x , x 2 , … , x d }
Fourier basis functions: DFT, FFT
kernels
compute higher dimension inner products
K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ K(x^i, x^j) = \langle \phi(x^i), \phi(x^j) \rangle K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j )⟩
Polynomial kernels of degree 2:
k ( x i , x j ) = ( 1 + ( x i ) T x j ) 2 = ( 1 + ⟨ x i , x j ⟩ ) 2 ∵ O ( d ) operations k(x^i, x^j) = (1 + (x^i)^T x^j)^2 = (1 + \langle{x^i, x^j} \rangle)^2
\\
\\
\because O(d) \text{ operations} k ( x i , x j ) = ( 1 + ( x i ) T x j ) 2 = ( 1 + ⟨ x i , x j ⟩ ) 2 ∵ O ( d ) operations
k ( x i , x j ) = ( 1 + ( x i ) T x j ) M k(x^i, x^j) = (1 + (x^i)^T x^j)^M k ( x i , x j ) = ( 1 + ( x i ) T x j ) M
How many operations?
improved: d + log M d + \log M d + log M ops
kernel least squares
Steps:
W ∗ = min W ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 W^{*} = \min\limits_{W} \|\phi W - Y\|_2^2 + \lambda \| W \|_2^2 W ∗ = W min ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2
shows that ∃ a ∈ R n ∣ W ∗ = ϕ T a \exists \space a \in \mathbb{R}^n \mid W^{*} = \phi^T a ∃ a ∈ R n ∣ W ∗ = ϕ T a , or W ∗ = ∑ a i ϕ ( x i ) W^{*} = \sum a_i \phi(x^i) W ∗ = ∑ a i ϕ ( x i )
0 = ∂ ∂ W ( ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 ) = 2 W T ( ϕ T ϕ ) − 2 Y T ϕ + 2 λ W T ⟹ λ W = ϕ T Y − ϕ T ϕ W ⟹ λ W = ϕ T ( Y − ϕ W ) λ \begin{aligned}
0 &= \frac{\partial}{\partial W} (\|\phi W - Y\|_2^2 + \lambda \| W \|_2^2) \\
&= 2 W^T (\phi^T \phi) - 2 Y^T \phi + 2 \lambda W^T \\
&\implies \lambda W = \phi^T Y - \phi^T \phi W \\
&\implies \lambda W = \phi^T \frac{(Y - \phi W)}{\lambda} \\
\end{aligned} 0 = ∂ W ∂ ( ∥ ϕ W − Y ∥ 2 2 + λ ∥ W ∥ 2 2 ) = 2 W T ( ϕ T ϕ ) − 2 Y T ϕ + 2 λ W T ⟹ λW = ϕ T Y − ϕ T ϕ W ⟹ λW = ϕ T λ ( Y − ϕ W )
Uses W ∗ = ∑ a i ϕ ( x i ) W^{*} = \sum a_i \phi(x^i) W ∗ = ∑ a i ϕ ( x i ) to form the dual representation of the problem.
min a → ∈ R n ∥ K a − Y ∥ 2 2 + λ a T K a ∵ Y ^ = ϕ ϕ T a = K n × n … a n × 1 \min\limits_{\overrightarrow{a} \in \mathbb{R}^n} \| Ka - Y \|_2^2 + \lambda a^T K a
\\
\because \hat{Y} = \phi \phi^T a = K_{n \times n} \dots a_{n \times 1} a ∈ R n min ∥ K a − Y ∥ 2 2 + λ a T K a ∵ Y ^ = ϕ ϕ T a = K n × n … a n × 1
Solution:
a ∗ = ( K + λ I ) − 1 Y a^{*} = (K + \lambda I)^{-1} Y a ∗ = ( K + λ I ) − 1 Y
choices
polynomial kernel: K ( x , z ) = ( 1 + x T z ) d K(x, z) = (1 + x^T z)^d K ( x , z ) = ( 1 + x T z ) d
Gaussian kernel: K ( x , z ) = e − ∥ x − z ∥ 2 2 2 σ 2 = e − α ∥ x − z ∥ 2 2 K(x, z) = e^{-\frac{\|x-z\|_2^2}{2\sigma^2}} = e^{-\alpha \|x-z\|^2_2} K ( x , z ) = e − 2 σ 2 ∥ x − z ∥ 2 2 = e − α ∥ x − z ∥ 2 2
mapping high-dimensional data
minimising reconstruction error
Given X ∈ R d × n X \in \mathbb{R}^{d \times n} X ∈ R d × n , find A A A that minimises the reconstruction error:
min A , B ∑ i ∥ x i − B A x i ∥ 2 2 \min\limits_{A,B} \sum_{i} \| x^i - B A x^i \|_2^2 A , B min i ∑ ∥ x i − B A x i ∥ 2 2
if q = d q=d q = d , then error is zero.
Solution:
B = A T B = A^T B = A T
min A ∑ i ∥ x i − A T A x i ∥ 2 \min\limits_{A} \sum_i \| x^i - A^T A x^i \|^2 A min ∑ i ∥ x i − A T A x i ∥ 2 is subjected to A A T = I q × q A A^T = I_{q \times q} A A T = I q × q
assuming data is centered, or 1 n ∑ _ i x i = [ 0 ⋯ 0 ] T \frac{1}{n} \sum\_{i} x^i = \begin{bmatrix} 0 & \cdots & 0 \end{bmatrix}^T n 1 ∑ _ i x i = [ 0 ⋯ 0 ] T
eigenvalue decomposition
X T X u = λ u X T X = U T Λ U ∵ Λ = diag ( λ 1 , λ 2 , ⋯ , λ d ) = [ λ 1 0 ⋯ 0 0 λ 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ q ] \begin{aligned}
X^T X \mathcal{u} &= \lambda \mathcal{u} \\
X^T X &= U^T \Lambda U \\
\\
\\
\because \Lambda &= \text{diag}(\lambda_1, \lambda_2, \cdots, \lambda_d) \\ &= \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\
0 & \lambda_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \lambda_q \end{bmatrix}
\end{aligned} X T X u X T X ∵ Λ = λ u = U T Λ U = diag ( λ 1 , λ 2 , ⋯ , λ d ) = λ 1 0 ⋮ 0 0 λ 2 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ λ q
l
pca
Idea: given input x 1 , ⋯ , x n ∈ R d x^1, \cdots, x^n \in \mathbb{R}^d x 1 , ⋯ , x n ∈ R d , μ = 1 n ∑ i x i \mu = \frac{1}{n} \sum_{i} x^i μ = n 1 ∑ i x i
Thus
C = ∑ ( x i − μ ) ( x i − μ ) T C = \sum (x^i - \mu)(x^i - \mu)^T C = ∑ ( x i − μ ) ( x i − μ ) T
Find the eigenvectors/values of C C C :
C = U T Λ U C = U^T \Lambda U C = U T Λ U
Optimal A A A is:
A = [ u 1 T u 2 T ⋮ u q T ] A = \begin{bmatrix}
u_1^T \\
u_2^T \\
\vdots \\
u_q^T
\end{bmatrix} A = u 1 T u 2 T ⋮ u q T
bayes rules and chain rules
Joint distribution: P ( X , Y ) P(X,Y) P ( X , Y )
Conditional distribution of X X X given Y Y Y : P ( X ∣ Y ) = P ( X , Y ) P ( Y ) P(X|Y) = \frac{P(X,Y)}{P(Y)} P ( X ∣ Y ) = P ( Y ) P ( X , Y )
Bayes rule: P ( X ∣ Y ) = P ( Y ∣ X ) P ( X ) P ( Y ) P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} P ( X ∣ Y ) = P ( Y ) P ( Y ∣ X ) P ( X )
Chain rule:
for two events:
P ( A , B ) = P ( B ∣ A ) P ( A ) P(A, B) = P(B \mid A)P(A) P ( A , B ) = P ( B ∣ A ) P ( A )
generalised:
P ( X 1 , X 2 , … , X k ) = P ( X 1 ) ∏ j = 2 n P ( X j ∣ X 1 , … , X j − 1 ) ∵ expansion: P ( X 1 ) P ( X 2 ∣ X 1 ) … P ( X k ∣ X 1 , X 2 , … , X k − 1 ) \begin{aligned}
&P(X_1, X_2, \ldots , X_k) \\
&= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt]
&\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1})
\end{aligned} P ( X 1 , X 2 , … , X k ) = P ( X 1 ) j = 2 ∏ n P ( X j ∣ X 1 , … , X j − 1 ) ∵ expansion: P ( X 1 ) P ( X 2 ∣ X 1 ) … P ( X k ∣ X 1 , X 2 , … , X k − 1 )
assume underlying distribution D D D , that train and test sets are independent and identically distributed (i.i.d)
Example: flip a coin
Outcome H = 0 H=0 H = 0 or T = 1 T=1 T = 1 with P ( H ) = p P(H) = p P ( H ) = p and P ( T ) = 1 − p P(T) = 1-p P ( T ) = 1 − p , or x ∈ { 0 , 1 } x \in \{0,1\} x ∈ { 0 , 1 } , x x x is the Bernoulli random variable.
P ( x = 0 ) = α P(x=0)=\alpha P ( x = 0 ) = α and P ( x = 1 ) = 1 − α P(x=1)=1-\alpha P ( x = 1 ) = 1 − α
a priori and posterior distribution
Would be maximum likelihood estimate
α ML = arg max P ( X ∣ α ) = arg min α − ∑ i log ( P ( x i ∣ α ) ) \alpha^{\text{ML}} = \argmax P(X | \alpha) = \argmin_{\alpha} - \sum_{i} \log (P(x^i | \alpha)) α ML = arg max P ( X ∣ α ) = α arg min − i ∑ log ( P ( x i ∣ α ))
maximum a posteriori estimation
α MAP = arg max P ( α ∣ X ) = arg max α P ( X ∣ α ) P ( α ) P ( X ) = arg min α ( − log P ( α ) ) − ∑ i = 1 n log P ( x i ∣ α ) \begin{aligned}
\alpha^{\text{MAP}} &= \argmax P(\alpha | X) \\
&= \argmax_{\alpha} \frac{P(X|\alpha)P(\alpha)}{P(X)} \\
&= \argmin_{\alpha}(-\log P(\alpha)) - \sum_{i=1}^{n} \log P(x^i | \alpha)
\end{aligned} α MAP = arg max P ( α ∣ X ) = α arg max P ( X ) P ( X ∣ α ) P ( α ) = α arg min ( − log P ( α )) − i = 1 ∑ n log P ( x i ∣ α )
arg max W P ( x ∣ α ) P ( α ) = arg max W [ log P ( α ) + ∑ i log ( x i , y i ∣ W ) ] = arg max W [ ln 1 β − λ ∥ W ∥ 2 2 − ( x i T W − y i ) 2 σ 2 ] \begin{aligned}
\argmax_{W} P(x | \alpha) P (\alpha) &= \argmax_{W} [\log P(\alpha) + \sum_{i} \log (x^i, y^i | W)] \\
&= \argmax_{W} [\ln \frac{1}{\beta} - \lambda {\parallel W \parallel}_{2}^{2} - \frac{({x^i}^T W - y^i)^2}{\sigma^2}]
\end{aligned} W arg max P ( x ∣ α ) P ( α ) = W arg max [ log P ( α ) + i ∑ log ( x i , y i ∣ W )] = W arg max [ ln β 1 − λ ∥ W ∥ 2 2 − σ 2 ( x i T W − y i ) 2 ]
P ( W ) = 1 β e λ ∥ W ∥ 2 2 P(W) = \frac{1}{\beta} e^{\lambda \parallel W \parallel_{2}^{2}} P ( W ) = β 1 e λ ∥ W ∥ 2 2
P ( W ) = 1 β e λ ∥ W ∥ 2 2 r 2 P(W) = \frac{1}{\beta} e^{\frac{\lambda \parallel W \parallel_{2}^{2}}{r^2}} P ( W ) = β 1 e r 2 λ ∥ W ∥ 2 2
arg max W P ( Z ∣ α ) = arg max W ∑ log P ( x i , y i ∣ W ) \argmax_{W} P(Z | \alpha) = \argmax_{W} \sum \log P(x^i, y^i | W) W arg max P ( Z ∣ α ) = W arg max ∑ log P ( x i , y i ∣ W )
P ( y ∣ x , W ) = 1 γ e − ( x T W − y ) 2 2 σ 2 P(y | x, W) = \frac{1}{\gamma} e^{-\frac{(x^T W-y)^2}{2 \sigma^2}} P ( y ∣ x , W ) = γ 1 e − 2 σ 2 ( x T W − y ) 2
expected error minimisation
think of it as bias-variance tradeoff
Squared loss: l ( y ^ , y ) = ( y − y ^ ) 2 l(\hat{y},y)=(y-\hat{y})^2 l ( y ^ , y ) = ( y − y ^ ) 2
solution to y ∗ = arg min y ^ E X , Y ( Y − y ^ ( X ) ) 2 y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 y ∗ = y ^ arg min E X , Y ( Y − y ^ ( X ) ) 2 is E [ Y ∣ X = x ] E[Y | X=x] E [ Y ∣ X = x ]
Instead we have Z = { ( x i , y i ) } i = 1 n Z = \{(x^i, y^i)\}^n_{i=1} Z = {( x i , y i ) } i = 1 n
error decomposition
E x , y ( y − y Z ^ ( x ) ) 2 = E x y ( y − y ∗ ( x ) ) 2 + E x ( y ∗ ( x ) − y Z ^ ( x ) ) 2 = noise + estimation error \begin{aligned}
&E_{x,y}(y-\hat{y_Z}(x))^2 \\
&= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\
&= \text{noise} + \text{estimation error}
\end{aligned} E x , y ( y − y Z ^ ( x ) ) 2 = E x y ( y − y ∗ ( x ) ) 2 + E x ( y ∗ ( x ) − y Z ^ ( x ) ) 2 = noise + estimation error
bias-variance decompositions
For linear estimator:
E Z E x , y ( y − ( y ^ Z ( x ) ≔ W Z T x ) ) 2 = E x , y ( y − y ∗ ( x ) ) 2 noise + E x ( y ∗ ( x ) − E Z ( y Z ^ ( x ) ) ) 2 bias + E x E Z ( y Z ^ ( x ) − E Z ( y Z ^ ( x ) ) ) 2 variance \begin{aligned}
E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\
=& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\
&+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\
&+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance}
\end{aligned} E Z = E x , y ( y − ( y ^ Z ( x ) : = W Z T x ) ) 2 E x , y ( y − y ∗ ( x ) ) 2 noise + E x ( y ∗ ( x ) − E Z ( y Z ^ ( x )) ) 2 bias + E x E Z ( y Z ^ ( x ) − E Z ( y Z ^ ( x )) ) 2 variance
See also: slides 13 , slides 14 , slides 15
expected error minimisation
think of it as bias-variance tradeoff
Squared loss: l ( y ^ , y ) = ( y − y ^ ) 2 l(\hat{y},y)=(y-\hat{y})^2 l ( y ^ , y ) = ( y − y ^ ) 2
solution to y ∗ = arg min y ^ E X , Y ( Y − y ^ ( X ) ) 2 y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 y ∗ = y ^ arg min E X , Y ( Y − y ^ ( X ) ) 2 is E [ Y ∣ X = x ] E[Y | X=x] E [ Y ∣ X = x ]
Instead we have Z = { ( x i , y i ) } i = 1 n Z = \{(x^i, y^i)\}^n_{i=1} Z = {( x i , y i ) } i = 1 n
error decomposition
E x , y ( y − y Z ^ ( x ) ) 2 = E x y ( y − y ∗ ( x ) ) 2 + E x ( y ∗ ( x ) − y Z ^ ( x ) ) 2 = noise + estimation error \begin{aligned}
&E_{x,y}(y-\hat{y_Z}(x))^2 \\
&= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\
&= \text{noise} + \text{estimation error}
\end{aligned} E x , y ( y − y Z ^ ( x ) ) 2 = E x y ( y − y ∗ ( x ) ) 2 + E x ( y ∗ ( x ) − y Z ^ ( x ) ) 2 = noise + estimation error
bias-variance decompositions
For linear estimator:
E Z E x , y ( y − ( y ^ Z ( x ) ≔ W Z T x ) ) 2 = E x , y ( y − y ∗ ( x ) ) 2 noise + E x ( y ∗ ( x ) − E Z ( y Z ^ ( x ) ) ) 2 bias + E x E Z ( y Z ^ ( x ) − E Z ( y Z ^ ( x ) ) ) 2 variance \begin{aligned}
E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\
=& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\
&+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\
&+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance}
\end{aligned} E Z = E x , y ( y − ( y ^ Z ( x ) : = W Z T x ) ) 2 E x , y ( y − y ∗ ( x ) ) 2 noise + E x ( y ∗ ( x ) − E Z ( y Z ^ ( x )) ) 2 bias + E x E Z ( y Z ^ ( x ) − E Z ( y Z ^ ( x )) ) 2 variance
accuracy
zero-one loss:
l 0 − 1 ( y , y ^ ) = 1 y ≠ y ^ = { 1 y ≠ y ^ 0 y = y ^ l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases} l 0 − 1 ( y , y ^ ) = 1 y = y ^ = { 1 0 y = y ^ y = y ^
linear classifier
y ^ W ( x ) = sign ( W T x ) = 1 W T x ≥ 0 ∵ W ^ = arg min W L Z 0 − 1 ( y ^ W ) \begin{aligned}
\hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt]
&\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W)
\end{aligned} y ^ W ( x ) = sign ( W T x ) = 1 W T x ≥ 0 ∵ W ^ = W arg min L Z 0 − 1 ( y ^ W )
surrogate loss functions
assume classifier returns a discrete value y ^ W = sign ( W T x ) ∈ { 0 , 1 } \hat{y}_W = \text{sign}(W^T x) \in \{0,1\} y ^ W = sign ( W T x ) ∈ { 0 , 1 }
What if classifier's output is continuous?
y ^ \hat{y} y ^ will also capture the “confidence” of the classifier.
Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.
linearly separable data
A binary classification data set Z = { ( x i , y i ) } i = 1 n Z=\{(x^i, y^i)\}_{i=1}^{n} Z = {( x i , y i ) } i = 1 n is linearly separable if there exists a W ∗ W^{*} W ∗ such that:
∀ i ∈ [ n ] ∣ SGN ( < x i , W ∗ > ) = y i \forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i ∀ i ∈ [ n ] ∣ SGN ( < x i , W ∗ > ) = y i
Or, for every i ∈ [ n ] i \in [n] i ∈ [ n ] we have ( W ∗ T x i ) y i > 0 (W^{* T}x^i)y^i > 0 ( W ∗ T x i ) y i > 0
linear programming
max W ∈ R d ⟨ u , w ⟩ = ∑ i = 1 d u i w i s.t A w ≥ v \begin{aligned}
\max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\
&\text{s.t } A w \ge v
\end{aligned} W ∈ R d max ⟨ u , w ⟩ = i = 1 ∑ d u i w i s.t A w ≥ v
Given that data is linearly separable
∃ W ∗ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i > 0 ∃ W ∗ , γ > 0 ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ γ ∃ W ∗ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ 1 \begin{aligned}
\exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\
\exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\
\exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1
\end{aligned} ∃ W ∗ ∃ W ∗ , γ > 0 ∃ W ∗ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i > 0 ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ γ ∣ ∀ i ∈ [ n ] , ( W ∗ T x i ) y i ≥ 1
LP for linear classification
perceptron
Rosenblatt’s perceptron algorithm
"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"
Algorithm 5 Batch Perceptron
Require: Training set ( x 1 , y 1 ) , … , ( x m , y m ) (\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m) ( x 1 , y 1 ) , … , ( x m , y m )
1: Initialize w ( 1 ) = ( 0 , … , 0 ) \mathbf{w}^{(1)} = (0,\ldots,0) w ( 1 ) = ( 0 , … , 0 )
2: for t = 1 , 2 , … t = 1,2,\ldots t = 1 , 2 , … do
3: if ( ∃ i s.t. y i ⟨ w ( t ) , x i ⟩ ≤ 0 ) (\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) ( ∃ i s.t. y i ⟨ w ( t ) , x i ⟩ ≤ 0 ) then
4: w ( t + 1 ) = w ( t ) + y i x i \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i w ( t + 1 ) = w ( t ) + y i x i
5: else
6: output w ( t ) \mathbf{w}^{(t)} w ( t )
7: end if
8: end for
greedy update
W new T x i y i = ⟨ W old + y i x i , x i ⟩ y i = W old T x i y i + ∥ x i ∥ 2 2 y i y i \begin{aligned}
W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\
&=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i}
\end{aligned} W new T x i y i = ⟨ W old + y i x i , x i ⟩ y i = W old T x i y i + ∥ x i ∥ 2 2 y i y i
proof
See also ( Novikoff, 1962 )
Assume there exists some parameter vector θ ‾ ∗ \underline{\theta}^{*} θ ∗ such that ∥ θ ‾ ∗ ∥ = 1 \|\underline{\theta}^{*}\| = 1 ∥ θ ∗ ∥ = 1 and
∃ γ > 0 s.t \exists \space \upgamma > 0 \text{ s.t } ∃ γ > 0 s.t
y t ( x t ‾ ⋅ θ ∗ ‾ ) ≥ γ y_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma y t ( x t ⋅ θ ∗ ) ≥ γ
Assumption : ∀ t = 1 … n , ∥ x t ‾ ∥ ≤ R \forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R ∀ t = 1 … n , ∥ x t ∥ ≤ R
Then perceptron makes at most R 2 γ 2 \frac{R^2}{\upgamma^2} γ 2 R 2 errors
proof by induction
definition of θ k ‾ \underline{\theta^k} θ k
to be parameter vector where algorithm makes k th k^{\text{th}} k th error.
Note that we have θ 1 ‾ = 0 ‾ \underline{\theta^{1}}=\underline{0} θ 1 = 0
Assume that k th k^{\text{th}} k th error is made on example t t t , or
θ k + 1 ‾ ⋅ θ ∗ ‾ = ( θ k ‾ + y t x t ‾ ) ⋅ θ ∗ ‾ = θ k ‾ ⋅ θ ∗ ‾ + y t x t ‾ ⋅ θ ∗ ‾ ≥ θ k ‾ ⋅ θ ∗ ‾ + γ ∵ Assumption: y t x t ‾ ⋅ θ ∗ ‾ ≥ γ \begin{align}
\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\
&= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\
&\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt]
&\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma
\end{align} θ k + 1 ⋅ θ ∗ = ( θ k + y t x t ) ⋅ θ ∗ = θ k ⋅ θ ∗ + y t x t ⋅ θ ∗ ≥ θ k ⋅ θ ∗ + γ ∵ Assumption: y t x t ⋅ θ ∗ ≥ γ
Follows up by induction on k k k that
θ k + 1 ‾ ⋅ θ ∗ ‾ ≥ k γ \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma θ k + 1 ⋅ θ ∗ ≥ kγ
Using Cauchy-Schwarz we have ∥ θ k + 1 ‾ ∥ × ∥ θ ∗ ‾ ∥ ≥ θ k + 1 ‾ ⋅ θ ∗ ‾ \|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} ∥ θ k + 1 ∥ × ∥ θ ∗ ∥ ≥ θ k + 1 ⋅ θ ∗
∥ θ k + 1 ‾ ∥ ≥ k γ ∵ ∥ θ ∗ ‾ ∥ = 1 \begin{align}
\|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt]
&\because \|\underline{\theta^{*}}\| = 1
\end{align} ∥ θ k + 1 ∥ ≥ kγ ∵ ∥ θ ∗ ∥ = 1
In the second part, we will find upper bound for (5):
∥ θ k + 1 ‾ ∥ 2 = ∥ θ k ‾ + y t x t ‾ ∥ 2 = ∥ θ k ‾ ∥ 2 + y t 2 ∥ x t ‾ ∥ 2 + 2 y t x t ‾ ⋅ θ k ‾ ≤ ∥ θ k ‾ ∥ 2 + R 2 \begin{align}
\|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\
&= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\
&\le \|\underline{\theta^k}\|^2 + R^2
\end{align} ∥ θ k + 1 ∥ 2 = ∥ θ k + y t x t ∥ 2 = ∥ θ k ∥ 2 + y t 2 ∥ x t ∥ 2 + 2 y t x t ⋅ θ k ≤ ∥ θ k ∥ 2 + R 2
(9) is due to:
y t 2 ∥ x t ‾ 2 ∥ 2 = ∥ x t ‾ 2 ∥ ≤ R 2 y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2 y t 2 ∥ x t 2 ∥ 2 = ∥ x t 2 ∥ ≤ R 2 by assumption of theorem
y t x t ‾ ⋅ θ k ‾ ≤ 0 y_t \underline{x_t} \cdot \underline{\theta^k} \le 0 y t x t ⋅ θ k ≤ 0 given parameter vector θ k ‾ \underline{\theta^k} θ k gave error at t th t^{\text{th}} t th example.
Follows with induction on k k k that
∥ θ k + 1 ‾ ∥ 2 ≤ k R 2 \begin{align}
\|\underline{\theta^{k+1}}\|^2 \le kR^2
\end{align} ∥ θ k + 1 ∥ 2 ≤ k R 2
from (5) and (10) gives us
k 2 γ 2 ≤ ∥ θ k + 1 ‾ ∥ 2 ≤ k R 2 k ≤ R 2 γ 2 \begin{aligned}
k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\
k &\le \frac{R^2}{\upgamma^2}
\end{aligned} k 2 γ 2 k ≤ ∥ θ k + 1 ∥ 2 ≤ k R 2 ≤ γ 2 R 2
linear algebra review.
Diagonal matrix: every entry except the diagonal is zero.
A = [ a 1 0 ⋯ 0 0 a 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ a n ] A = \begin{bmatrix} a_{1} & 0 & \cdots & 0 \\
0 & a_{2} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & a_{n} \end{bmatrix} A = a 1 0 ⋮ 0 0 a 2 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ a n
trace: sum of the entries in main diagonal: tr ( A ) = ∑ i = 1 n a i i \text{tr}(A) = \sum_{i=1}^{n} a_{ii} tr ( A ) = ∑ i = 1 n a ii
Properties of transpose:
( A T ) T = A ( A + B ) T = A T + B T ( A B ) T = B T A T \begin{aligned}
(A^T)^T &= A \\
(A + B)^T &= A^T + B^T \\
(AB)^T &= B^T A^T
\end{aligned} ( A T ) T ( A + B ) T ( A B ) T = A = A T + B T = B T A T
Properties of inverse:
( A − 1 ) − 1 = A ( A B ) − 1 = B − 1 A − 1 ( A T ) − 1 = ( A − 1 ) T \begin{aligned}
(A^{-1})^{-1} &= A \\
(AB)^{-1} &= B^{-1} A^{-1} \\
(A^T)^{-1} &= (A^{-1})^T
\end{aligned} ( A − 1 ) − 1 ( A B ) − 1 ( A T ) − 1 = A = B − 1 A − 1 = ( A − 1 ) T
if a matrix A − 1 A^{-1} A − 1 exists, mean A is invertible (non-singular), and vice versa.
Given a square matrix A ∈ R n × n A \in \mathbb{R}^{n \times n} A ∈ R n × n , the quadratic form is defined as: x T A x ∈ R x^TAx \in \mathbb{R} x T A x ∈ R
x T A x = ∑ i = 1 n ∑ j = 1 n a i j x i x j x^TAx = \sum_{i=1}^{n} \sum_{j=1}^{n} a_{ij} x_i x_j x T A x = i = 1 ∑ n j = 1 ∑ n a ij x i x j
norms
A function f : R n ⇒ R f : \mathbb{R}^n \Rightarrow \mathbb{R} f : R n ⇒ R is a norm if it satisfies the following properties:
non-negativity: ∀ x ∈ R n , f ( x ) > 0 \forall x \in \mathbb{R}^n, f(x) > 0 ∀ x ∈ R n , f ( x ) > 0
definiteness: f ( x ) = 0 ⟺ x = 0 f(x) = 0 \iff x=0 f ( x ) = 0 ⟺ x = 0
Homogeneity: ∀ x ∈ R n , t ∈ R , f ( t x ) ≤ ∣ t ∣ f ( x ) \forall x \in \mathbb{R}^n, t\in \mathbb{R}, f(tx) \leq \mid t\mid f(x) ∀ x ∈ R n , t ∈ R , f ( t x ) ≤∣ t ∣ f ( x )
triangle inequality: ∀ x , y ∈ R n , f ( x + y ) ≤ f ( x ) + f ( y ) \forall x, y \in \mathbb{R}^n, f(x+y) \leq f(x) + f(y) ∀ x , y ∈ R n , f ( x + y ) ≤ f ( x ) + f ( y )
symmetry
A square matrix A ∈ R n × n A \in \mathbb{R}^{n \times n} A ∈ R n × n is symmetric if A = A T ∣ A ∈ S n A = A^T \mid A \in \mathbb{S}^n A = A T ∣ A ∈ S n
Anti-semi-symmetric if A = − A T ∣ A A = -A^T \mid A A = − A T ∣ A
Given any square matrix A ∈ R n × n A \in \mathbb{R}^{n \times n} A ∈ R n × n , the matrix A + A T A + A^T A + A T is symmetric, and A − A T A - A^T A − A T is anti-symmetric.
A = 1 2 ( A + A T ) + 1 2 ( A − A T ) A = \frac{1}{2}(A+A^T) + \frac{1}{2}(A-A^T) A = 2 1 ( A + A T ) + 2 1 ( A − A T )
A A A is positive definite if x T A x > 0 ∀ x ∈ R n x^TAx > 0 \forall x \in \mathbb{R}^n x T A x > 0∀ x ∈ R n .
It is denoted by A ≻ 0 A \succ 0 A ≻ 0 .
The set of all positive definite matrices is denoted by S + + n \mathbb{S}^n_{++} S ++ n
A A A is positive semi-definite if x T A x ≥ 0 ∀ x ∈ R n x^TAx \geq 0 \forall x \in \mathbb{R}^n x T A x ≥ 0∀ x ∈ R n .
It is denoted by A ⪰ 0 A \succeq 0 A ⪰ 0 .
The set of all positive semi-definite matrices is denoted by S + n \mathbb{S}^n_{+} S + n
A A A is negative definite if x T A x < 0 ∀ x ∈ R n x^TAx < 0 \forall x \in \mathbb{R}^n x T A x < 0∀ x ∈ R n .
It is denoted by A ≺ 0 A \prec 0 A ≺ 0 .
The set of all negative definite matrices is denoted by S − − n \mathbb{S}^n_{--} S −− n
A A A is negative semi-definite if x T A x ≤ 0 ∀ x ∈ R n x^TAx \leq 0 \forall x \in \mathbb{R}^n x T A x ≤ 0∀ x ∈ R n .
It is denoted by A ⪯ 0 A \preceq 0 A ⪯ 0 .
The set of all negative semi-definite matrices is denoted by S − n \mathbb{S}^n_{-} S − n
A symmetric matrix A ∈ S n A \in \mathbb{S}^n A ∈ S n is indefinite if it is neither positive semi-definite or negative semi-definite.
∃ x 1 , x 2 ∈ R n ∣ x 1 T A x 1 > 0 a n d x 2 T A x 2 < 0 \exists x_1, x_2 \in \mathbb{R}^n \space \mid \space x_1^TAx_1 > 0 \space and \space x_2^TAx_2 < 0 ∃ x 1 , x 2 ∈ R n ∣ x 1 T A x 1 > 0 an d x 2 T A x 2 < 0
Given any matrix A ∈ R m × n A \in \mathbb{R}^{m \times n} A ∈ R m × n , the matrix G = A T A G = A^TA G = A T A is always positive semi-definite (known as the Gram matrix)
Proof: x T G x = x T A T A x = ( A x ) T ( A x ) = ∥ A x ∥ 2 2 ≥ 0 x^TGx = x^TA^TAx = (Ax)^T(Ax) = \|Ax\|_2^2 \geq 0 x T G x = x T A T A x = ( A x ) T ( A x ) = ∥ A x ∥ 2 2 ≥ 0
eigenvalues and eigenvectors
The non-zero vector x ∈ C n x \in \mathbb{C}^n x ∈ C n is an eigenvector of A and λ ∈ C \lambda \in \mathbb{C} λ ∈ C is called the eigenvalue of A if:
A x = λ x Ax = \lambda x A x = λ x
∃ non-zero eigenvector x ∈ C ⟺ null space of ( A − λ I ) is non-empty ⟹ ∣ A − λ I ∣ is singular ∣ A − λ I ∣ = 0 \begin{aligned}
\exists \text{ non-zero eigenvector } x \in \mathbb{C} & \iff \text{ null space of } (A - \lambda I) \text{ is non-empty} \\
\implies \mid A - \lambda I \mid \text{ is singular } \\
\mid A - \lambda I \mid &= 0
\end{aligned} ∃ non-zero eigenvector x ∈ C ⟹ ∣ A − λ I ∣ is singular ∣ A − λ I ∣ ⟺ null space of ( A − λ I ) is non-empty = 0
Solving eigenvectors via ( A − λ i I ) x i = 0 (A-\lambda_{i}I)x_i=0 ( A − λ i I ) x i = 0
See also matrix cookbook
matrix representation of a system of linear equations
x 1 + x 2 + x 3 = 5 x 1 − 2 x 2 − 3 x 3 = − 1 2 x 1 + x 2 − x 3 = 3 \begin{aligned}
x_1 + x_2 + x_3 &= 5 \\
x_1 - 2x_2 - 3x_3 &= -1 \\
2x_1 + x_2 - x_3 &= 3
\end{aligned} x 1 + x 2 + x 3 x 1 − 2 x 2 − 3 x 3 2 x 1 + x 2 − x 3 = 5 = − 1 = 3
Equivalent matrix representation of A x = b Ax = b A x = b
A = [ 1 1 1 1 − 2 − 3 2 1 − 1 ] x = [ x 1 x 2 x 3 ] b = [ 5 − 1 3 ] ∵ A ∈ R m × n , x ∈ R n , b ∈ R m \begin{aligned}
A &= \begin{bmatrix}
1 & 1 & 1 \\
1 & -2 & -3 \\
2 & 1 & -1
\end{bmatrix} \\
x &= \begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix} \\
b &= \begin{bmatrix}
5 \\
-1 \\
3
\end{bmatrix}
\end{aligned}
\because A \in R^{m \times n}, x \in R^n, b \in R^m A x b = 1 1 2 1 − 2 1 1 − 3 − 1 = x 1 x 2 x 3 = 5 − 1 3 ∵ A ∈ R m × n , x ∈ R n , b ∈ R m
A ∈ R m × n A \in R^{m \times n} A ∈ R m × n and A T ∈ R n × m A^T \in R^{n \times m} A T ∈ R n × m
dot product.
⟨ x , y ⟩ = ∑ i = 1 n x i y i = ∑ i = 1 n x i ⋅ y i \begin{aligned}
\langle x, y \rangle &= \sum_{i=1}^{n} x_i y_i \\
&= \sum_{i=1}^{n} x_i \cdot y_i
\end{aligned} ⟨ x , y ⟩ = i = 1 ∑ n x i y i = i = 1 ∑ n x i ⋅ y i
linear combination of columns
Let A ∈ R m × n A \in R^{m \times n} A ∈ R m × n , X ∈ R n X \in R^n X ∈ R n , A x ∈ R n Ax \in R^n A x ∈ R n
Then A x = ∑ i = 1 n ⟨ a i ⟩ x i ∈ R n Ax = \sum_{i=1}^{n}{\langle a_i \rangle} x_i \in R^n A x = ∑ i = 1 n ⟨ a i ⟩ x i ∈ R n
inverse of a matrix
The inverse of a square matrix A ∈ R n × n A \in R^{n \times n} A ∈ R n × n is a unique matrix denoted by A − 1 ∈ R n × n A^{-1} \in \mathbb{R}^{n\times{n}} A − 1 ∈ R n × n
A − 1 A = I = A A − 1 A^{-1} A = I = A A^{-1} A − 1 A = I = A A − 1
euclidean norm
L 2 L_{2} L 2 norm:
∥ x ∥ 2 = ∑ i = 1 n x i 2 = X T X \| x \|_{2} = \sqrt{\sum_{i=1}^{n}{x_i^2}} = X^TX ∥ x ∥ 2 = i = 1 ∑ n x i 2 = X T X
L1 norm: ∥ x ∥ 1 = ∑ i = 1 n ∣ x i ∣ \| x \|_{1} = \sum_{i=1}^{n}{|x_i|} ∥ x ∥ 1 = ∑ i = 1 n ∣ x i ∣
L ∞ L_{\infty} L ∞ norm: ∥ x ∥ ∞ = max i ∣ x i ∣ \| x \|_{\infty} = \max_{i}{|x_i|} ∥ x ∥ ∞ = max i ∣ x i ∣
p-norm: ∥ x ∥ p = ( ∑ i = 1 n ∣ x i ∣ p ) 1 / p \| x \|_{p} = (\sum_{i=1}^{n}{|x_i|^p})^{1/p} ∥ x ∥ p = ( ∑ i = 1 n ∣ x i ∣ p ) 1/ p
∥ x ∥ ∞ ≤ ∥ x ∥ 2 ≤ ∥ x ∥ _ 1 \|x\|_{\infty} \leq \|x\|_{2} \leq \|x\|\_{1} ∥ x ∥ ∞ ≤ ∥ x ∥ 2 ≤ ∥ x ∥_ 1
One can prove this with Cauchy-Schwarz inequality
linear dependence of vectors
Given { x 1 , x 2 , … , x n } ⊆ R d \{x_1, x_2, \ldots, x_n\} \subseteq \mathbb{R}^d { x 1 , x 2 , … , x n } ⊆ R d and α 1 , α 2 , … , α n ∈ R \alpha_1, \alpha_2, \ldots, \alpha_n \in \mathbb{R} α 1 , α 2 , … , α n ∈ R
∀ i ∈ [ n ] , ∀ { a 1 , a 2 , … , a n } ⊆ R d s . t . x i ≠ ∑ j = 1 n a j x j \forall i \in [ n ], \forall \{a_1, a_2, \ldots, a_n\} \subseteq \mathbb{R}^d \space s.t. \space x_i \neq \sum_{j=1}^{n}{a_j x_j} ∀ i ∈ [ n ] , ∀ { a 1 , a 2 , … , a n } ⊆ R d s . t . x i = j = 1 ∑ n a j x j
Span
Given a set of vectors { x 1 , x 2 , … , x n } ⊆ R d \{x_1, x_2, \ldots, x_n\} \subseteq \mathbb{R}^d { x 1 , x 2 , … , x n } ⊆ R d , the span of the set is the set of all possible linear combinations of the vectors.
span ( { x 1 , x 2 , … , x n } ) = { y : y = ∑ i = 1 n α i x i ∣ α i ∈ R } \text{span}(\{x_1, x_2, \ldots, x_n\}) = \{ y: y = \sum_{i=1}^{n}{\alpha_i x_i} \mid \alpha_i \in \mathbb{R} \} span ({ x 1 , x 2 , … , x n }) = { y : y = i = 1 ∑ n α i x i ∣ α i ∈ R }
If x 1 , x 2 , … , x n x_{1}, x_{2}, \ldots, x_{n} x 1 , x 2 , … , x n are linearly independent, then the span of the set is the entire space R d \mathbb{R}^d R d
Rank
For a matrix A ∈ R m × n A \in \mathbb{R}^{m \times n} A ∈ R m × n :
column rank: max number of linearly independent columns of A A A
row rank: max number of linearly independent rows of A A A
If rank ( A ) ≤ m \text{rank}(A) \leq m rank ( A ) ≤ m , then the rows are linearly independent. If rank ( A ) ≤ n \text{rank}(A) \leq n rank ( A ) ≤ n , then the columns are linearly independent.
rank of a matrix A A A is the number of linearly independent columns of A A A :
if A A A is full rank, then rank ( A ) = min ( m , n ) \text{rank}(A) = \min(m, n) rank ( A ) = min ( m , n ) (rank ( A ) ≤ min ( m , n ) \text{rank}(A) \leq \min(m, n) rank ( A ) ≤ min ( m , n ) )
rank ( A ) = rank ( A T ) \text{rank}(A) = \text{rank}(A^T) rank ( A ) = rank ( A T )
solving linear system of equations
If A ∈ R n A \in \mathbb{R}^{n} A ∈ R n is invertible, there exists a solution:
x = A − 1 b x = A^{-1}b x = A − 1 b
Range and Projection
Given a matrix A ∈ R m × n A \in \mathbb{R}^{m \times n} A ∈ R m × n , the range of A A A , denoted by R ( A ) \mathcal{R}(A) R ( A ) is the span of columns of A A A :
R ( A ) = { y ∈ R m ∣ y = A x ∣ x ∈ R m } \mathcal{R}(A) = \{ y \in \mathbb{R}^m \mid y = Ax \mid x \in \mathbb{R}^m \} R ( A ) = { y ∈ R m ∣ y = A x ∣ x ∈ R m }
Projection of a vector y ∈ R m y \in \mathbb{R}^m y ∈ R m onto span ( { x 1 , ⋯ , x n } ) \text{span}(\{x_1, \cdots, x_n\}) span ({ x 1 , ⋯ , x n }) , x i ∈ R m x_i \in \mathbb{R}^m x i ∈ R m is a vector in the span that is as close as possible to y y y wrt l 2 l_2 l 2 norm
Proj ( y ; { x 1 , ⋯ , x n } ) = arg min v ∈ span ( { x 1 , ⋯ , x n } ) ∥ y − v ∥ 2 \text{Proj}(y; \{x_{1}, \cdots, x_n\}) = \argmin_{{v \in \text{span}(\{x_1, \cdots, x_n\})}} \| y - v \|_2 Proj ( y ; { x 1 , ⋯ , x n }) = v ∈ span ({ x 1 , ⋯ , x n }) arg min ∥ y − v ∥ 2
Null space of A A A
is the set of all vectors that satisfies the following:
N ( A ) = { x ∈ R n ∣ A x = 0 } \mathcal{N}(A) = \{ x \in \mathbb{R}^n \mid Ax = 0 \} N ( A ) = { x ∈ R n ∣ A x = 0 }
probability theory
With Bayes rules we have
P ( Y ∣ X ) = P ( X ∣ Y ) P ( Y ) P ( X ) P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} P ( Y ∣ X ) = P ( X ) P ( X ∣ Y ) P ( Y )
Chain rule states for event A 1 , … A n A_1, \ldots A_n A 1 , … A n :
P ( A 1 ∩ A 2 ∩ … ∩ A n ) = P ( A n ∣ A n − 1 ∩ … ∩ A 1 ) P ( A n − 1 ∩ … ∩ A 1 ) = P ( A 1 ) ∏ i = 2 n P ( A i ∣ ∩ j = 1 i − 1 A j ) \begin{aligned}
P(A_1 \cap A_2 \cap \ldots \cap A_n) &= P(A_n|A_{n-1} \cap \ldots \cap A_1)P(A_{n-1} \cap \ldots \cap A_1) \\
&= P(A_1) \prod_{i=2}^{n} P(A_i|\cap_{j=1}^{i-1} A_j)
\end{aligned} P ( A 1 ∩ A 2 ∩ … ∩ A n ) = P ( A n ∣ A n − 1 ∩ … ∩ A 1 ) P ( A n − 1 ∩ … ∩ A 1 ) = P ( A 1 ) i = 2 ∏ n P ( A i ∣ ∩ j = 1 i − 1 A j )
If B 1 , … , B n B_{1}, \ldots , B_{n} B 1 , … , B n are finite partition of the same space, or ∀ i ≠ j , B i ∩ B j = ∅ ∧ ∪ i = 1 n B i = Ω \forall i \neq j, B_i \cap B_j = \emptyset \land \cup_{i=1}^{n} B_i = \Omega ∀ i = j , B i ∩ B j = ∅ ∧ ∪ i = 1 n B i = Ω , then law of total probability state that for an event A
P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) P(A) = \sum_{i=1}^{n} P(A|B_i)P(B_i) P ( A ) = i = 1 ∑ n P ( A ∣ B i ) P ( B i )
cumulative distribution function
For a random variable X, a CDF F X ( x ) : R → [ 0 , 1 ] F_X(x): \mathbb{R} \rightarrow [0,1] F X ( x ) : R → [ 0 , 1 ] is defined as:
F X ( x ) ≔ P ( X ≤ x ) F_X(x) \coloneqq P(X \leq x) F X ( x ) : = P ( X ≤ x )
0 < F X ( x ) < 1 0<F_X(x)<1 0 < F X ( x ) < 1
P ( a ≤ X ≤ b ) = F X ( b ) − F X ( a ) P(a \leq X \leq b) =F_X(b) -F_X(a) P ( a ≤ X ≤ b ) = F X ( b ) − F X ( a )
probability mass function
for a discrete random variable X, the probability mass function p X ( x ) : R → [ 0 , 1 p_X(x) : \mathbb{R} \rightarrow [0,1 p X ( x ) : R → [ 0 , 1 is defined as:
p X ( x ) ≔ P ( X = x ) p_X(x) \coloneqq P(X=x) p X ( x ) : = P ( X = x )
0 ≤ p X ( x ) ≤ 1 0 \leq p_X(x) \leq 1 0 ≤ p X ( x ) ≤ 1
∑ x ∈ D p X ( x ) = 1 , D is a set of all possible values of X \sum_{x \in \mathbb{D}} p_X(x) = 1, \mathbb{D} \text{ is a set of all possible values of X} ∑ x ∈ D p X ( x ) = 1 , D is a set of all possible values of X
P ( X ∈ A ) = P ( { ω : X ( ω ) ∈ A } ) = ∑ x ∈ A p X ( x ) P(X \in A) = P(\{\omega: X(\omega) \in A\}) = \sum_{x \in A} p_X(x) P ( X ∈ A ) = P ({ ω : X ( ω ) ∈ A }) = ∑ x ∈ A p X ( x )
probability density function
for a continuous random variable X, the probability density function f X ( x ) : R → [ 0 , ∞ ) f_X(x) : \mathbb{R} \rightarrow [0, \infty) f X ( x ) : R → [ 0 , ∞ ) is defined as:
f X ( x ) ≔ d F X ( x ) d x f_X(x) \coloneqq \frac{d F_X(x)}{dx} f X ( x ) : = d x d F X ( x )
f X ( x ) ≥ 0 f_X(x) \geq 0 f X ( x ) ≥ 0
F X ( x ) = ∫ − ∞ x f X ( x ) d x F_X(x) = \int_{-\infty}^{x}f_X(x)dx F X ( x ) = ∫ − ∞ x f X ( x ) d x
Expectation
for a discrete random variable with PMF p X ( x ) p_X(x) p X ( x ) and g ( x ) : R → R g(x): \mathbb{R} \rightarrow \mathbb{R} g ( x ) : R → R , the expectation of g ( x ) g(x) g ( x ) is:
E [ g ( X ) ] = ∑ x ∈ D g ( x ) p X ( x ) \mathbb{E}[g(X)] = \sum_{x \in \mathbb{D}} g(x) p_X(x) E [ g ( X )] = x ∈ D ∑ g ( x ) p X ( x )
for a continuous random variable with PDF f X ( x ) f_X(x) f X ( x ) , the expectation of g ( x ) g(x) g ( x ) is:
E [ g ( X ) ] = ∫ − ∞ ∞ g ( x ) f X ( x ) d x \mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx E [ g ( X )] = ∫ − ∞ ∞ g ( x ) f X ( x ) d x
Therefore, mean of a random variable X is E [ X ] \mathbb{E}[X] E [ X ] :
μ = E [ X ] = ∫ − ∞ ∞ x f X ( x ) d x \mu = \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) dx μ = E [ X ] = ∫ − ∞ ∞ x f X ( x ) d x
Variance of a random variable X is:
σ 2 = E [ ( X − μ ) 2 ] = E [ X 2 ] − E [ X ] 2 \sigma^2 = \mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 σ 2 = E [( X − μ ) 2 ] = E [ X 2 ] − E [ X ] 2
Var ( f ( X ) + c ) = Var ( f ( X ) ) \text{Var}(f(X)+c)=\text{Var}(f(X)) Var ( f ( X ) + c ) = Var ( f ( X ))
Var ( c f ( X ) ) = c 2 Var ( f ( X ) ) \text{Var}(cf(X)) = c^2 \text{Var}(f(X)) Var ( c f ( X )) = c 2 Var ( f ( X ))
discrete random variables
Bernoulli distribution: X ∼ Bernoulli ( p ) , 0 ≤ p ≤ 1 X \sim \text{Bernoulli}(p), 0 \le p \le 1 X ∼ Bernoulli ( p ) , 0 ≤ p ≤ 1
p X ( x ) = { p if x = 1 1 − p if x = 0 E [ X ] = p Var ( X ) = p ( 1 − p ) \begin{aligned}
p_X(x) &= \begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases} \\
\\
\mathbb{E}[X] &= p \\
\text{Var}(X) &= p(1-p)
\end{aligned} p X ( x ) E [ X ] Var ( X ) = { p 1 − p if x = 1 if x = 0 = p = p ( 1 − p )
Binomial distribution: X ∼ Binomial ( n , p ) , 0 ≤ p ≤ 1 X \sim \text{Binomial}(n,p), 0 \le p \le 1 X ∼ Binomial ( n , p ) , 0 ≤ p ≤ 1
p X ( x ) = ( n x ) p x ( 1 − p ) n − x ∵ ( n x ) = n ! x ! ( n − x ) ! E [ X ] = n p Var ( X ) = n p ( 1 − p ) \begin{aligned}
p_X(x) &= \binom{n}{x} p^x (1-p)^{n-x} \\
\\
\because \binom{n}{x} &= \frac{n!}{x!(n-x)!} \\
\mathbb{E}[X] &= np \\
\text{Var}(X) &= np(1-p)
\end{aligned} p X ( x ) ∵ ( x n ) E [ X ] Var ( X ) = ( x n ) p x ( 1 − p ) n − x = x ! ( n − x )! n ! = n p = n p ( 1 − p )
Poisson distribution: X ∼ Poisson ( λ ) , λ > 0 X \sim \text{Poisson}(\lambda), \lambda > 0 X ∼ Poisson ( λ ) , λ > 0
p X ( x ) = e − λ λ x x ! E [ X ] = λ Var ( X ) = λ \begin{aligned}
p_X(x) &= \frac{e^{-\lambda} \lambda^x}{x!} \\
\mathbb{E}[X] &= \lambda \\
\text{Var}(X) &= \lambda
\end{aligned} p X ( x ) E [ X ] Var ( X ) = x ! e − λ λ x = λ = λ
continuous random variables
Uniform distribution: X ∼ Unif ( a , b ) , a ≤ b X \sim \text{Unif}(a,b), a \le b X ∼ Unif ( a , b ) , a ≤ b
f X ( x ) = { 1 b − a if a ≤ x ≤ b 0 otherwise E [ X ] = a + b 2 Var ( X ) = ( b − a ) 2 12 \begin{aligned}
f_X(x) &= \begin{cases} \frac{1}{b-a} & \text{if } a \le x \le b \\ 0 & \text{otherwise} \end{cases} \\
\\
\mathbb{E}[X] &= \frac{a+b}{2} \\
\text{Var}(X) &= \frac{(b-a)^2}{12}
\end{aligned} f X ( x ) E [ X ] Var ( X ) = { b − a 1 0 if a ≤ x ≤ b otherwise = 2 a + b = 12 ( b − a ) 2
Exponential distribution: X ∼ Exp ( λ ) , λ > 0 X \sim \text{Exp}(\lambda), \lambda > 0 X ∼ Exp ( λ ) , λ > 0
f X ( x ) = λ e − λ x E [ X ] = 1 λ Var ( X ) = 1 λ 2 \begin{aligned}
f_X(x) = \lambda e^{-\lambda x} \\
\\
\mathbb{E}[X] &= \frac{1}{\lambda} \\
\text{Var}(X) &= \frac{1}{\lambda^2}
\end{aligned} f X ( x ) = λ e − λ x E [ X ] Var ( X ) = λ 1 = λ 2 1
Gaussian distribution: X ∼ N ( μ , σ 2 ) , − ∞ < μ < ∞ , σ 2 > 0 X \sim \mathcal{N}(\mu, \sigma^2), -\infty < \mu < \infty, \sigma^2 > 0 X ∼ N ( μ , σ 2 ) , − ∞ < μ < ∞ , σ 2 > 0
p X ( x ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 E [ X ] = μ Var ( X ) = σ 2 \begin{aligned}
p_X(x) &= \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \\
\\
\mathbb{E}[X] &= \mu \\
\text{Var}(X) &= \sigma^2
\end{aligned} p X ( x ) E [ X ] Var ( X ) = 2 π σ 2 1 e − 2 σ 2 ( x − μ ) 2 = μ = σ 2