Supervised machine learning

probability density function

if $X$ is a random variable, the probability density function (pdf) is a function $f(x)$ such that:

P(a \leq X \leq b) = \int_{a}^{b} f(x) dx

if distribution of $X$ is uniform over $[a,b]$ , then $f(x) = \frac{1}{b-a}$

url: thoughts/.../Linear-regression
description: curve fitting
curve fitting

how do we fit a distribution of data over a curve?

Given a set of $n$ data points $S=\set{(x^i, y^i)}^{n}_{n=1}$

$x \in \mathbb{R}^{d}$

$y \in \mathbb{R}$ (or $\mathbb{R}^{k}$ )

Lien vers l'original

url: thoughts/.../Linear-regression
description: 1D OLS
In the case of 1-D ordinary least square, the problems equates find $a,b \in \mathbb{R}$ to minimize $\min\limits_{a,b} \sum_{i=1}^{n} (ax^i + b - y^i)^2$
Lien vers l'original

minimize $f(a, b) = \sum^{n}_{i=1}{(ax^i + b - y^i)^2}$

$\begin{aligned} \frac{\partial f}{\partial a} &= 2 \sum^{n}_{i=1}{(ax^i + b - y^i)} x^{i} = 0 \\ \frac{\partial f}{\partial b} &= 2 \sum^{n}_{i=1}{(ax^i + b - y^i)} = 0 \\ \\ \implies 2nb + 2a \sum_{i=1}^{n} x^i &= 2 \sum_{i=1}^{n} y^i \\ \implies b + a \overline{x} &= \overline{y} \\ \implies b &= \overline{y} - a \overline{x} \\ \\ \because \overline{y} &= \frac{1}{n} \sum_{i=1}^{n} y^{i} \\ \overline{x} &= \frac{1}{n} \sum_{i=1}^{n} x^{i} \end{aligned}$

url: thoughts/.../Linear-regression
description: optimal solution
optimal solution
$\begin{aligned} a &= \frac{\overline{xy} - \overline{x} \cdot \overline{y}}{\overline{x^2} - (\overline{x})^2} = \frac{\text{COV}(x,y)}{\text{Var}(x)} \\ b &= \overline{y} - a \overline{x} \end{aligned}$
where $\overline{x} = \frac{1}{N} \sum{x^i}$ , $\overline{y} = \frac{1}{N} \sum{y^i}$ , $\overline{xy} = \frac{1}{N} \sum{x^i y^i}$ , $\overline{x^2} = \frac{1}{N} \sum{(x^i)^2}$
Lien vers l'original

url: thoughts/.../Linear-regression
description: hyperplane
hyperplane

Hyperplane equation

$\begin{aligned} \hat{y} &= w_{0} + \sum_{j=1}^{d}{w_j x_j}\\[12pt] &\because w_0: \text{the y-intercept (bias)} \end{aligned}$

Homogeneous hyperplane:
$\begin{aligned} w_{0} & = 0 \\ \hat{y} &= \sum_{j=1}^{d}{w_j x_j} = \langle{w,x} \rangle \\ &= w^Tx \end{aligned}$
Matrix form OLS:
$X_{n\times d} = \begin{pmatrix} x_1^1 & \cdots & x_d^1 \\ \vdots & \ddots & \vdots \\ x_1^n & \cdots & x_d^n \end{pmatrix}, Y_{n\times 1} = \begin{pmatrix} y^1 \\ \vdots \\ y^n \end{pmatrix}, W_{d\times 1} = \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix}$ $\begin{aligned} \text{Obj} &: \sum_{i=1}^n (\hat{y}^i - y^i)^2 = \sum_{i=1}^n (\langle w, x^i \rangle - y^i)^2 \\ &\\\ \text{Def} &: \Delta = \begin{pmatrix} \Delta_1 \\ \vdots \\ \Delta_n \end{pmatrix} = \begin{pmatrix} x_1^1 & \cdots & x_d^1 \\ \vdots & \ddots & \vdots \\ x_1^n & \cdots & x_d^n \end{pmatrix} \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix} - \begin{pmatrix} y^1 \\ \vdots \\ y^n \end{pmatrix} = \begin{pmatrix} \hat{y}^1 - y^1 \\ \vdots \\ \hat{y}^n - y^n \end{pmatrix} \end{aligned}$

minimize $w$

$\min\limits_{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2$

OLS solution

$W^{\text{LS}} = (X^T X)^{-1}{X^T Y}$

Example:
$\hat{y} = w_{0} + w_{1} \cdot x_{1} + w_{2} \cdot x_{2}$
With
$X_{n \times 2} = \begin{pmatrix} x^{1}_{1} & x^{1}_{2} \\ x^{2}_{1} & x^{2}_{2} \\ x^{3}_{1} & x^{3}_{2} \end{pmatrix}$
and
$X^{'}_{n \times 3} = \begin{pmatrix} x^{1}_{1} & x^{1}_{2} & 1 \\ x^{2}_{1} & x^{2}_{2} & 1 \\ x^{3}_{1} & x^{3}_{2} & 1 \end{pmatrix}$
With
$W = \begin{pmatrix} w_1 \\ w_2 \end{pmatrix}$
and
$W^{'} = \begin{pmatrix} w_1 \\ w_2 \\ w_0 \end{pmatrix}$
thus
$X^{'} \times W = \begin{pmatrix} w_0 + \sum{w_i \times x_i^{1}} \\ \vdots \\ w_0 + \sum{w_i \times x_i^{n}} \end{pmatrix}$ Lien vers l'original

url: thoughts/.../Bias-and-intercept
description: adding bias in D-dimensions OLS
adding bias in D-dimensions OLS
$X^{'}_{n \times (d+1)} = \begin{pmatrix} x_1^{1} & \cdots & x_1^{d} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ x_n^{1} & \cdots & x_n^{d} & 1 \end{pmatrix}$
and
$W_{(d+1) \times 1} = \begin{pmatrix} w_1 \\ \vdots \\ w_d \\ w_0 \end{pmatrix}$
Add an new auxiliary dimension to the input data, $x_{d+1} = 1$

Solve OLS:
$\min\limits{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2$
Gradient for $f: \mathbb{R}^d \rightarrow \mathbb{R}$
$\triangledown_{w} \space f = \begin{bmatrix} \frac{\partial f}{\partial w_1} \\ \vdots \\ \frac{\partial f}{\partial w_d} \\ \end{bmatrix}$
Jacobian for $g: \mathbb{R}^m \rightarrow \mathbb{R}^n$
$\begin{aligned} \triangledown_{w} \space g &= \begin{bmatrix} \frac{\partial g_1}{\partial w_1} & \cdots & \frac{\partial g_1}{\partial w_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_n}{\partial w_1} & \cdots & \frac{\partial g_n}{\partial w_d} \end{bmatrix}_{n \times m} \\ \\ &u, t \in \mathbb{R}^d \\ &\because g(u) = u^T v \implies \triangledown_{w} \space g = v \text{ (gradient) } \\ \\ &A \in \mathbb{R}^{n \times n}; u \in \mathbb{R}^n \\ &\because g(u) = u^T A u \implies \triangledown_{w} \space g = (A + A^T) u^T \text{ (Jacobian) } \end{aligned}$

result

$W^{\text{LS}} = (X^T X)^{-1} X^T Y$

Lien vers l'original

url: thoughts/.../Bias-and-intercept
description: overfitting
overfitting.

strategies to avoid:

add more training data

L1 (Lasso) or L2 (Ridge) regularization

add a penalty term to the objective function

L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0. $\text{Loss}(w) = \text{Error} + \lambda \times \| w \|$

L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1 $\text{Loss}(w) = \text{Error} + \lambda \times w^2$

Cross-validation

split data into k-fold

early stopping

dropout, see example

randomly selected neurons are ignored ⇒ makes network less sensitive

sample complexity of learning multivariate polynomials
Lien vers l'original

url: thoughts/.../Bias-and-intercept
description: regularization
regularization.

L2 regularization:
$\text{min}_{W \in \mathbb{R}^{d}} \| XW - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2}$

Solving $W^{\text{RLS}}$

Solve that
$W^{\text{RLS}} = (X^T X + \lambda I)^{-1} X^T Y$
Inverse exists as long as $\lambda > 0$

Lien vers l'original

url: thoughts/.../Bias-and-intercept
description: polynomial-curve-fitting-revisited
polynomial curve-fitting revisited

feature map: $\phi{(x)}: R^{d_1} \rightarrow R^{d_2}$ where $d_{2} >> d_{1}$

training:

$W^{*} = \min\limits{W} \| \phi W - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2}$

$W^{*} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y$

prediction:

$\hat{y} = \langle{W^{*}, \phi{(x)}} \rangle = {W^{*}}^T \phi(x)$

choices of $\phi(x)$

Gaussian basis functions: $\phi(x) = \exp{(-\frac{\| x - \mu \|^{2}}{2\sigma^{2}})}$

Polynomial basis functions: $\phi(x) = \{1, x, x^{2}, \ldots, x^{d}\}$

Fourier basis functions: DFT, FFT

Lien vers l'original

url: thoughts/.../Bias-and-intercept
description: kernels
kernels

compute higher dimension inner products
$K(x^i, x^j) = \langle \phi(x^i), \phi(x^j) \rangle$
Polynomial kernels of degree 2:
$k(x^i, x^j) = (1 + (x^i)^T x^j)^2 = (1 + \langle{x^i, x^j} \rangle)^2 \\ \\ \because O(d) \text{ operations}$

degree M polynomial

$k(x^i, x^j) = (1 + (x^i)^T x^j)^M$

How many operations?

improved: $d + \log M$ ops

Lien vers l'original

kernel least squares

Steps:

$W^{*} = \min\limits_{W} \|\phi W - Y\|_2^2 + \lambda \| W \|_2^2$
shows that $\exists \space a \in \mathbb{R}^n \mid W^{*} = \phi^T a$ , or $W^{*} = \sum a_i \phi(x^i)$

proof

$\begin{aligned} 0 &= \frac{\partial}{\partial W} (\|\phi W - Y\|_2^2 + \lambda \| W \|_2^2) \\ &= 2 W^T (\phi^T \phi) - 2 Y^T \phi + 2 \lambda W^T \\ &\implies \lambda W = \phi^T Y - \phi^T \phi W \\ &\implies \lambda W = \phi^T \frac{(Y - \phi W)}{\lambda} \\ \end{aligned}$

Uses $W^{*} = \sum a_i \phi(x^i)$ to form the dual representation of the problem.

\min\limits_{\overrightarrow{a} \in \mathbb{R}^n} \| Ka - Y \|_2^2 + \lambda a^T K a \\ \because \hat{Y} = \phi \phi^T a = K_{n \times n} \dots a_{n \times 1}

Solution:

a^{*} = (K + \lambda I)^{-1} Y

choices

polynomial kernel: $K(x, z) = (1 + x^T z)^d$
Gaussian kernel: $K(x, z) = e^{-\frac{\|x-z\|_2^2}{2\sigma^2}} = e^{-\alpha \|x-z\|^2_2}$

mapping high-dimensional data

url: thoughts/.../principal-component-analysis
description: minimising reconstruction error
minimising reconstruction error

Given $X \in \mathbb{R}^{d \times n}$ , find $A$ that minimises the reconstruction error: $\min\limits_{A,B} \sum_{i} \| x^i - B A x^i \|_2^2$

if $q=d$ , then error is zero.

Solution:

$B = A^T$

$\min\limits_{A} \sum_i \| x^i - A^T A x^i \|^2$ is subjected to $A A^T = I_{q \times q}$

assuming data is centered, or $\frac{1}{n} \sum\_{i} x^i = \begin{bmatrix} 0 & \cdots & 0 \end{bmatrix}^T$

Lien vers l'original

url: thoughts/.../principal-component-analysis
description: eigenvalue decomposition
eigenvalue decomposition
$\begin{aligned} X^T X \mathcal{u} &= \lambda \mathcal{u} \\ X^T X &= U^T \Lambda U \\ \\ \\ \because \Lambda &= \text{diag}(\lambda_1, \lambda_2, \cdots, \lambda_d) \\ &= \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_q \end{bmatrix} \end{aligned}$ Lien vers l'original

l

url: thoughts/.../principal-component-analysis
description: pca
pca

Idea: given input $x^1, \cdots, x^n \in \mathbb{R}^d$ , $\mu = \frac{1}{n} \sum_{i} x^i$

Thus
$C = \sum (x^i - \mu)(x^i - \mu)^T$
Find the eigenvectors/values of $C$ :
$C = U^T \Lambda U$
Optimal $A$ is:
$A = \begin{bmatrix} u_1^T \\ u_2^T \\ \vdots \\ u_q^T \end{bmatrix}$ Lien vers l'original

bayes rules and chain rules

Joint distribution: $P(X,Y)$

Conditional distribution of $X$ given $Y$ : $P(X|Y) = \frac{P(X,Y)}{P(Y)}$

Bayes rule: $P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}$

Chain rule:

for two events:

P(A, B) = P(B \mid A)P(A)

generalised:

\begin{aligned} &P(X_1, X_2, \ldots , X_k) \\ &= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt] &\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1}) \end{aligned}

i.i.d assumption

assume underlying distribution $D$ , that train and test sets are independent and identically distributed (i.i.d)

Example: flip a coin

Outcome $H=0$ or $T=1$ with $P(H) = p$ and $P(T) = 1-p$ , or $x \in \{0,1\}$ , $x$ is the Bernoulli random variable.

$P(x=0)=\alpha$ and $P(x=1)=1-\alpha$

a priori and posterior distribution

Would be maximum likelihood estimate

\alpha^{\text{ML}} = \argmax P(X | \alpha) = \argmin_{\alpha} - \sum_{i} \log (P(x^i | \alpha))

url: thoughts/.../likelihood
description: maximum a posteriori estimation
maximum a posteriori estimation
$\begin{aligned} \alpha^{\text{MAP}} &= \argmax P(\alpha | X) \\ &= \argmax_{\alpha} \frac{P(X|\alpha)P(\alpha)}{P(X)} \\ &= \argmin_{\alpha}(-\log P(\alpha)) - \sum_{i=1}^{n} \log P(x^i | \alpha) \end{aligned}$ $\begin{aligned} \argmax_{W} P(x | \alpha) P (\alpha) &= \argmax_{W} [\log P(\alpha) + \sum_{i} \log (x^i, y^i | W)] \\ &= \argmax_{W} [\ln \frac{1}{\beta} - \lambda {\parallel W \parallel}_{2}^{2} - \frac{({x^i}^T W - y^i)^2}{\sigma^2}] \end{aligned}$ $P(W) = \frac{1}{\beta} e^{\lambda \parallel W \parallel_{2}^{2}}$

What if we have

$P(W) = \frac{1}{\beta} e^{\frac{\lambda \parallel W \parallel_{2}^{2}}{r^2}}$

$\argmax_{W} P(Z | \alpha) = \argmax_{W} \sum \log P(x^i, y^i | W)$ $P(y | x, W) = \frac{1}{\gamma} e^{-\frac{(x^T W-y)^2}{2 \sigma^2}}$ Lien vers l'original

url: thoughts/.../likelihood
description: expected error minimisation
expected error minimisation

think of it as bias-variance tradeoff

Squared loss: $l(\hat{y},y)=(y-\hat{y})^2$

solution to $y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2$ is $E[Y | X=x]$

Instead we have $Z = \{(x^i, y^i)\}^n_{i=1}$

error decomposition
$\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}$
bias-variance decompositions

For linear estimator:
$\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}$ Lien vers l'original

url: thoughts/.../nearest-neighbour
nearest neighbour
See also: slides 13, slides 14, slides 15

url: thoughts/.../likelihood
description: expected error minimisation
expected error minimisation

think of it as bias-variance tradeoff

Squared loss: $l(\hat{y},y)=(y-\hat{y})^2$

solution to $y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2$ is $E[Y | X=x]$

Instead we have $Z = \{(x^i, y^i)\}^n_{i=1}$

error decomposition
$\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}$
bias-variance decompositions

For linear estimator:
$\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}$ Lien vers l'original

accuracy

zero-one loss:
$l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}$
linear classifier
$\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}$
surrogate loss functions

assume classifier returns a discrete value $\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}$

What if classifier's output is continuous?

$\hat{y}$ will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set $Z=\{(x^i, y^i)\}_{i=1}^{n}$ is linearly separable if there exists a $W^{*}$ such that:

$\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i$

Or, for every $i \in [n]$ we have $(W^{* T}x^i)y^i > 0$

linear programming
$\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}$
Given that data is linearly separable
$\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}$
LP for linear classification

Define $A = [x_j^iy^i]_{n \times d}$

find optimal $W$ equivalent to
$\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}$

perceptron

Rosenblatt’s perceptron algorithm

$"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"$

Algorithm 5 Batch Perceptron

Require: Training set $(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)$

1:Initialize $\mathbf{w}^{(1)} = (0,\ldots,0)$

2:for $t = 1,2,\ldots$ do

3:if $(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0)$ then

4: $\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i$

5:else

6:output $\mathbf{w}^{(t)}$

7:end if

8:end for

greedy update
$\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}$
proof

See also (Novikoff, 1962)

Theorem

Assume there exists some parameter vector $\underline{\theta}^{*}$ such that $\|\underline{\theta}^{*}\| = 1$ and $\exists \space \upgamma > 0 \text{ s.t }$
$y_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma$
Assumption: $\forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R$

Then perceptron makes at most $\frac{R^2}{\upgamma^2}$ errors

proof by induction

definition of $\underline{\theta^k}$

to be parameter vector where algorithm makes $k^{\text{th}}$ error.

Note that we have $\underline{\theta^{1}}=\underline{0}$

Assume that $k^{\text{th}}$ error is made on example $t$ , or
$\begin{align} \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\ &= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\ &\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt] &\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma \end{align}$
Follows up by induction on $k$ that
$\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma$
Using Cauchy-Schwarz we have $\|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}}$
$\begin{align} \|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt] &\because \|\underline{\theta^{*}}\| = 1 \end{align}$
In the second part, we will find upper bound for (5):
$\begin{align} \|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\ &= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\ &\le \|\underline{\theta^k}\|^2 + R^2 \end{align}$
(9) is due to:

$y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2$ by assumption of theorem

$y_t \underline{x_t} \cdot \underline{\theta^k} \le 0$ given parameter vector $\underline{\theta^k}$ gave error at $t^{\text{th}}$ example.

Follows with induction on $k$ that
$\begin{align} \|\underline{\theta^{k+1}}\|^2 \le kR^2 \end{align}$
from (5) and (10) gives us
$\begin{aligned} k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\ k &\le \frac{R^2}{\upgamma^2} \end{aligned}$
Bibliographie
Novikoff, A. B. J. (1962). On Convergence Proofs for Perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata, 12, 615–622. https://apps.dtic.mil/sti/tr/pdf/AD0298258.pdf
Lien vers l'original

linear algebra review.

Diagonal matrix: every entry except the diagonal is zero.

A = \begin{bmatrix} a_{1} & 0 & \cdots & 0 \\ 0 & a_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_{n} \end{bmatrix}

trace: sum of the entries in main diagonal: $\text{tr}(A) = \sum_{i=1}^{n} a_{ii}$

Properties of transpose:

\begin{aligned} (A^T)^T &= A \\ (A + B)^T &= A^T + B^T \\ (AB)^T &= B^T A^T \end{aligned}

Properties of inverse:

\begin{aligned} (A^{-1})^{-1} &= A \\ (AB)^{-1} &= B^{-1} A^{-1} \\ (A^T)^{-1} &= (A^{-1})^T \end{aligned}

Inverse of a matrix

if a matrix $A^{-1}$ exists, mean A is invertible (non-singular), and vice versa.

quadratic form

Given a square matrix $A \in \mathbb{R}^{n \times n}$ , the quadratic form is defined as: $x^TAx \in \mathbb{R}$

x^TAx = \sum_{i=1}^{n} \sum_{j=1}^{n} a_{ij} x_i x_j

norms

A function $f : \mathbb{R}^n \Rightarrow \mathbb{R}$ is a norm if it satisfies the following properties:

non-negativity: $\forall x \in \mathbb{R}^n, f(x) > 0$
definiteness: $f(x) = 0 \iff x=0$
Homogeneity: $\forall x \in \mathbb{R}^n, t\in \mathbb{R}, f(tx) \leq \mid t\mid f(x)$
triangle inequality: $\forall x, y \in \mathbb{R}^n, f(x+y) \leq f(x) + f(y)$

symmetry

A square matrix $A \in \mathbb{R}^{n \times n}$ is symmetric if $A = A^T \mid A \in \mathbb{S}^n$

Anti-semi-symmetric if $A = -A^T \mid A$

Given any square matrix $A \in \mathbb{R}^{n \times n}$ , the matrix $A + A^T$ is symmetric, and $A - A^T$ is anti-symmetric.

$A = \frac{1}{2}(A+A^T) + \frac{1}{2}(A-A^T)$

positive definite

$A$ is positive definite if $x^TAx > 0 \forall x \in \mathbb{R}^n$ .

It is denoted by $A \succ 0$ .

The set of all positive definite matrices is denoted by $\mathbb{S}^n_{++}$

positive semi-definite

$A$ is positive semi-definite if $x^TAx \geq 0 \forall x \in \mathbb{R}^n$ .

It is denoted by $A \succeq 0$ .

The set of all positive semi-definite matrices is denoted by $\mathbb{S}^n_{+}$

negative definite

$A$ is negative definite if $x^TAx < 0 \forall x \in \mathbb{R}^n$ .

It is denoted by $A \prec 0$ .

The set of all negative definite matrices is denoted by $\mathbb{S}^n_{--}$

negative semi-definite

$A$ is negative semi-definite if $x^TAx \leq 0 \forall x \in \mathbb{R}^n$ .

It is denoted by $A \preceq 0$ .

The set of all negative semi-definite matrices is denoted by $\mathbb{S}^n_{-}$

A symmetric matrix $A \in \mathbb{S}^n$ is indefinite if it is neither positive semi-definite or negative semi-definite.

\exists x_1, x_2 \in \mathbb{R}^n \space \mid \space x_1^TAx_1 > 0 \space and \space x_2^TAx_2 < 0

Given any matrix $A \in \mathbb{R}^{m \times n}$ , the matrix $G = A^TA$ is always positive semi-definite (known as the Gram matrix)

Proof: $x^TGx = x^TA^TAx = (Ax)^T(Ax) = \|Ax\|_2^2 \geq 0$

eigenvalues and eigenvectors

The non-zero vector $x \in \mathbb{C}^n$ is an eigenvector of A and $\lambda \in \mathbb{C}$ is called the eigenvalue of A if:

Ax = \lambda x

finding eigenvalues

$\begin{aligned} \exists \text{ non-zero eigenvector } x \in \mathbb{C} & \iff \text{ null space of } (A - \lambda I) \text{ is non-empty} \\ \implies \mid A - \lambda I \mid \text{ is singular } \\ \mid A - \lambda I \mid &= 0 \end{aligned}$
Solving eigenvectors via $(A-\lambda_{i}I)x_i=0$

See also matrix cookbook

matrix representation of a system of linear equations
$\begin{aligned} x_1 + x_2 + x_3 &= 5 \\ x_1 - 2x_2 - 3x_3 &= -1 \\ 2x_1 + x_2 - x_3 &= 3 \end{aligned}$
Equivalent matrix representation of $Ax = b$
$\begin{aligned} A &= \begin{bmatrix} 1 & 1 & 1 \\ 1 & -2 & -3 \\ 2 & 1 & -1 \end{bmatrix} \\ x &= \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \\ b &= \begin{bmatrix} 5 \\ -1 \\ 3 \end{bmatrix} \end{aligned} \because A \in R^{m \times n}, x \in R^n, b \in R^m$

Transpose of a matrix

$A \in R^{m \times n}$ and $A^T \in R^{n \times m}$

dot product.
$\begin{aligned} \langle x, y \rangle &= \sum_{i=1}^{n} x_i y_i \\ &= \sum_{i=1}^{n} x_i \cdot y_i \end{aligned}$
linear combination of columns

Let $A \in R^{m \times n}$ , $X \in R^n$ , $Ax \in R^n$

Then $Ax = \sum_{i=1}^{n}{\langle a_i \rangle} x_i \in R^n$

inverse of a matrix

The inverse of a square matrix $A \in R^{n \times n}$ is a unique matrix denoted by $A^{-1} \in \mathbb{R}^{n\times{n}}$
$A^{-1} A = I = A A^{-1}$
euclidean norm

$L_{2}$ norm:
$\| x \|_{2} = \sqrt{\sum_{i=1}^{n}{x_i^2}} = X^TX$
L1 norm: $\| x \|_{1} = \sum_{i=1}^{n}{|x_i|}$

$L_{\infty}$ norm: $\| x \|_{\infty} = \max_{i}{|x_i|}$

p-norm: $\| x \|_{p} = (\sum_{i=1}^{n}{|x_i|^p})^{1/p}$

Comparison

$\|x\|_{\infty} \leq \|x\|_{2} \leq \|x\|\_{1}$

One can prove this with Cauchy-Schwarz inequality

linear dependence of vectors

Given $\{x_1, x_2, \ldots, x_n\} \subseteq \mathbb{R}^d$ and $\alpha_1, \alpha_2, \ldots, \alpha_n \in \mathbb{R}$
$\forall i \in [ n ], \forall \{a_1, a_2, \ldots, a_n\} \subseteq \mathbb{R}^d \space s.t. \space x_i \neq \sum_{j=1}^{n}{a_j x_j}$
Span

Given a set of vectors $\{x_1, x_2, \ldots, x_n\} \subseteq \mathbb{R}^d$ , the span of the set is the set of all possible linear combinations of the vectors.
$\text{span}(\{x_1, x_2, \ldots, x_n\}) = \{ y: y = \sum_{i=1}^{n}{\alpha_i x_i} \mid \alpha_i \in \mathbb{R} \}$

If $x_{1}, x_{2}, \ldots, x_{n}$ are linearly independent, then the span of the set is the entire space $\mathbb{R}^d$

Rank

For a matrix $A \in \mathbb{R}^{m \times n}$ :

column rank: max number of linearly independent columns of $A$

row rank: max number of linearly independent rows of $A$

If $\text{rank}(A) \leq m$ , then the rows are linearly independent. If $\text{rank}(A) \leq n$ , then the columns are linearly independent.

rank of a matrix $A$ is the number of linearly independent columns of $A$ :

if $A$ is full rank, then $\text{rank}(A) = \min(m, n)$ ( $\text{rank}(A) \leq \min(m, n)$ )

$\text{rank}(A) = \text{rank}(A^T)$

solving linear system of equations

If $A \in \mathbb{R}^{n}$ is invertible, there exists a solution:
$x = A^{-1}b$
Range and Projection

Given a matrix $A \in \mathbb{R}^{m \times n}$ , the range of $A$ , denoted by $\mathcal{R}(A)$ is the span of columns of $A$ :
$\mathcal{R}(A) = \{ y \in \mathbb{R}^m \mid y = Ax \mid x \in \mathbb{R}^m \}$
Projection of a vector $y \in \mathbb{R}^m$ onto $\text{span}(\{x_1, \cdots, x_n\})$ , $x_i \in \mathbb{R}^m$ is a vector in the span that is as close as possible to $y$ wrt $l_2$ norm
$\text{Proj}(y; \{x_{1}, \cdots, x_n\}) = \argmin_{{v \in \text{span}(\{x_1, \cdots, x_n\})}} \| y - v \|_2$
Null space of $A$

is the set of all vectors that satisfies the following:
$\mathcal{N}(A) = \{ x \in \mathbb{R}^n \mid Ax = 0 \}$ Lien vers l'original

probability theory

With Bayes rules we have

P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}

Chain rule states for event $A_1, \ldots A_n$ :

\begin{aligned} P(A_1 \cap A_2 \cap \ldots \cap A_n) &= P(A_n|A_{n-1} \cap \ldots \cap A_1)P(A_{n-1} \cap \ldots \cap A_1) \\ &= P(A_1) \prod_{i=2}^{n} P(A_i|\cap_{j=1}^{i-1} A_j) \end{aligned}

Law of Total Probability

If $B_{1}, \ldots , B_{n}$ are finite partition of the same space, or $\forall i \neq j, B_i \cap B_j = \emptyset \land \cup_{i=1}^{n} B_i = \Omega$ , then law of total probability state that for an event A
$P(A) = \sum_{i=1}^{n} P(A|B_i)P(B_i)$

cumulative distribution function

For a random variable X, a CDF $F_X(x): \mathbb{R} \rightarrow [0,1]$ is defined as:

F_X(x) \coloneqq P(X \leq x)

$0<F_X(x)<1$
$P(a \leq X \leq b) =F_X(b) -F_X(a)$

probability mass function

for a discrete random variable X, the probability mass function $p_X(x) : \mathbb{R} \rightarrow [0,1$ is defined as:

p_X(x) \coloneqq P(X=x)

$0 \leq p_X(x) \leq 1$
$\sum_{x \in \mathbb{D}} p_X(x) = 1, \mathbb{D} \text{ is a set of all possible values of X}$
$P(X \in A) = P(\{\omega: X(\omega) \in A\}) = \sum_{x \in A} p_X(x)$

probability density function

for a continuous random variable X, the probability density function $f_X(x) : \mathbb{R} \rightarrow [0, \infty)$ is defined as:

f_X(x) \coloneqq \frac{d F_X(x)}{dx}

$f_X(x) \geq 0$
$F_X(x) = \int_{-\infty}^{x}f_X(x)dx$

Expectation

for a discrete random variable with PMF $p_X(x)$ and $g(x): \mathbb{R} \rightarrow \mathbb{R}$ , the expectation of $g(x)$ is:

\mathbb{E}[g(X)] = \sum_{x \in \mathbb{D}} g(x) p_X(x)

for a continuous random variable with PDF $f_X(x)$ , the expectation of $g(x)$ is:

\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx

Therefore, mean of a random variable X is $\mathbb{E}[X]$ :

\mu = \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) dx

Variance of a random variable X is:

\sigma^2 = \mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2

$\text{Var}(f(X)+c)=\text{Var}(f(X))$
$\text{Var}(cf(X)) = c^2 \text{Var}(f(X))$

discrete random variables

Bernoulli distribution: $X \sim \text{Bernoulli}(p), 0 \le p \le 1$

\begin{aligned} p_X(x) &= \begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases} \\ \\ \mathbb{E}[X] &= p \\ \text{Var}(X) &= p(1-p) \end{aligned}

Binomial distribution: $X \sim \text{Binomial}(n,p), 0 \le p \le 1$

\begin{aligned} p_X(x) &= \binom{n}{x} p^x (1-p)^{n-x} \\ \\ \because \binom{n}{x} &= \frac{n!}{x!(n-x)!} \\ \mathbb{E}[X] &= np \\ \text{Var}(X) &= np(1-p) \end{aligned}

Poisson distribution: $X \sim \text{Poisson}(\lambda), \lambda > 0$

\begin{aligned} p_X(x) &= \frac{e^{-\lambda} \lambda^x}{x!} \\ \mathbb{E}[X] &= \lambda \\ \text{Var}(X) &= \lambda \end{aligned}

continuous random variables

Uniform distribution: $X \sim \text{Unif}(a,b), a \le b$

\begin{aligned} f_X(x) &= \begin{cases} \frac{1}{b-a} & \text{if } a \le x \le b \\ 0 & \text{otherwise} \end{cases} \\ \\ \mathbb{E}[X] &= \frac{a+b}{2} \\ \text{Var}(X) &= \frac{(b-a)^2}{12} \end{aligned}

Exponential distribution: $X \sim \text{Exp}(\lambda), \lambda > 0$

\begin{aligned} f_X(x) = \lambda e^{-\lambda x} \\ \\ \mathbb{E}[X] &= \frac{1}{\lambda} \\ \text{Var}(X) &= \frac{1}{\lambda^2} \end{aligned}

Gaussian distribution: $X \sim \mathcal{N}(\mu, \sigma^2), -\infty < \mu < \infty, \sigma^2 > 0$

\begin{aligned} p_X(x) &= \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \\ \\ \mathbb{E}[X] &= \mu \\ \text{Var}(X) &= \sigma^2 \end{aligned}

Supervised machine learning

Étiquette

publié à

modifié à

durée

source

probability density function

curve fitting

optimal solution

hyperplane

adding bias in D-dimensions OLS

overfitting.

regularization.

polynomial curve-fitting revisited

kernels

kernel least squares

choices

mapping high-dimensional data

minimising reconstruction error

eigenvalue decomposition

pca

bayes rules and chain rules

a priori and posterior distribution

maximum a posteriori estimation

expected error minimisation

error decomposition

bias-variance decompositions

nearest neighbour

expected error minimisation

error decomposition

bias-variance decompositions

accuracy

linear classifier

surrogate loss functions

linearly separable data

linear programming

LP for linear classification

perceptron

greedy update

proof

Bibliographie

linear algebra review.

quadratic form

norms

symmetry

eigenvalues and eigenvectors

matrix representation of a system of linear equations

dot product.

linear combination of columns

inverse of a matrix

euclidean norm

linear dependence of vectors

Span

Rank

solving linear system of equations

Range and Projection

Null space of AAA

probability theory

cumulative distribution function

probability mass function

probability density function

Expectation

discrete random variables

continuous random variables

Vous pourriez aimer ce qui suit

Liens retour

Null space of $A$