profile pic
⌘ '
raccourcis clavier

See also: book

probability density function

if XX is a random variable, the probability density function (pdf) is a function f(x)f(x) such that:

P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_{a}^{b} f(x) dx

if distribution of XX is uniform over [a,b][a,b], then f(x)=1baf(x) = \frac{1}{b-a}

curve fitting

how do we fit a distribution of data over a curve?

Given a set of nn data points S={(xi,yi)}n=1nS=\set{(x^i, y^i)}^{n}_{n=1}

  • xRdx \in \mathbb{R}^{d}
  • yRy \in \mathbb{R} (or Rk\mathbb{R}^{k})
Lien vers l'original

In the case of 1-D ordinary least square, the problems equates find a,bRa,b \in \mathbb{R} to minimize mina,bi=1n(axi+byi)2\min\limits_{a,b} \sum_{i=1}^{n} (ax^i + b - y^i)^2

Lien vers l'original

minimize f(a,b)=i=1n(axi+byi)2f(a, b) = \sum^{n}_{i=1}{(ax^i + b - y^i)^2}

fa=2i=1n(axi+byi)xi=0fb=2i=1n(axi+byi)=0    2nb+2ai=1nxi=2i=1nyi    b+ax=y    b=yaxy=1ni=1nyix=1ni=1nxi\begin{aligned} \frac{\partial f}{\partial a} &= 2 \sum^{n}_{i=1}{(ax^i + b - y^i)} x^{i} = 0 \\ \frac{\partial f}{\partial b} &= 2 \sum^{n}_{i=1}{(ax^i + b - y^i)} = 0 \\ \\ \implies 2nb + 2a \sum_{i=1}^{n} x^i &= 2 \sum_{i=1}^{n} y^i \\ \implies b + a \overline{x} &= \overline{y} \\ \implies b &= \overline{y} - a \overline{x} \\ \\ \because \overline{y} &= \frac{1}{n} \sum_{i=1}^{n} y^{i} \\ \overline{x} &= \frac{1}{n} \sum_{i=1}^{n} x^{i} \end{aligned}

optimal solution

a=xyxyx2(x)2=COV(x,y)Var(x)b=yax\begin{aligned} a &= \frac{\overline{xy} - \overline{x} \cdot \overline{y}}{\overline{x^2} - (\overline{x})^2} = \frac{\text{COV}(x,y)}{\text{Var}(x)} \\ b &= \overline{y} - a \overline{x} \end{aligned}

where x=1Nxi\overline{x} = \frac{1}{N} \sum{x^i}, y=1Nyi\overline{y} = \frac{1}{N} \sum{y^i}, xy=1Nxiyi\overline{xy} = \frac{1}{N} \sum{x^i y^i}, x2=1N(xi)2\overline{x^2} = \frac{1}{N} \sum{(x^i)^2}

Lien vers l'original

hyperplane

Hyperplane equation

y^=w0+j=1dwjxjw0:the y-intercept (bias)\begin{aligned} \hat{y} &= w_{0} + \sum_{j=1}^{d}{w_j x_j}\\[12pt] &\because w_0: \text{the y-intercept (bias)} \end{aligned}

Homogeneous hyperplane:

w0=0y^=j=1dwjxj=w,x=wTx\begin{aligned} w_{0} & = 0 \\ \hat{y} &= \sum_{j=1}^{d}{w_j x_j} = \langle{w,x} \rangle \\ &= w^Tx \end{aligned}

Matrix form OLS:

Xn×d=(x11xd1x1nxdn),Yn×1=(y1yn),Wd×1=(w1wd)X_{n\times d} = \begin{pmatrix} x_1^1 & \cdots & x_d^1 \\ \vdots & \ddots & \vdots \\ x_1^n & \cdots & x_d^n \end{pmatrix}, Y_{n\times 1} = \begin{pmatrix} y^1 \\ \vdots \\ y^n \end{pmatrix}, W_{d\times 1} = \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix} Obj:i=1n(y^iyi)2=i=1n(w,xiyi)2 Def:Δ=(Δ1Δn)=(x11xd1x1nxdn)(w1wd)(y1yn)=(y^1y1y^nyn)\begin{aligned} \text{Obj} &: \sum_{i=1}^n (\hat{y}^i - y^i)^2 = \sum_{i=1}^n (\langle w, x^i \rangle - y^i)^2 \\ &\\\ \text{Def} &: \Delta = \begin{pmatrix} \Delta_1 \\ \vdots \\ \Delta_n \end{pmatrix} = \begin{pmatrix} x_1^1 & \cdots & x_d^1 \\ \vdots & \ddots & \vdots \\ x_1^n & \cdots & x_d^n \end{pmatrix} \begin{pmatrix} w_1 \\ \vdots \\ w_d \end{pmatrix} - \begin{pmatrix} y^1 \\ \vdots \\ y^n \end{pmatrix} = \begin{pmatrix} \hat{y}^1 - y^1 \\ \vdots \\ \hat{y}^n - y^n \end{pmatrix} \end{aligned}

minimize ww

minWRd×1XWY22\min\limits_{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2

OLS solution

WLS=(XTX)1XTYW^{\text{LS}} = (X^T X)^{-1}{X^T Y}

Example:

y^=w0+w1x1+w2x2\hat{y} = w_{0} + w_{1} \cdot x_{1} + w_{2} \cdot x_{2}

With

Xn×2=(x11x21x12x22x13x23)X_{n \times 2} = \begin{pmatrix} x^{1}_{1} & x^{1}_{2} \\ x^{2}_{1} & x^{2}_{2} \\ x^{3}_{1} & x^{3}_{2} \end{pmatrix}

and

Xn×3=(x11x211x12x221x13x231)X^{'}_{n \times 3} = \begin{pmatrix} x^{1}_{1} & x^{1}_{2} & 1 \\ x^{2}_{1} & x^{2}_{2} & 1 \\ x^{3}_{1} & x^{3}_{2} & 1 \end{pmatrix}

With

W=(w1w2)W = \begin{pmatrix} w_1 \\ w_2 \end{pmatrix}

and

W=(w1w2w0)W^{'} = \begin{pmatrix} w_1 \\ w_2 \\ w_0 \end{pmatrix}

thus

X×W=(w0+wi×xi1w0+wi×xin)X^{'} \times W = \begin{pmatrix} w_0 + \sum{w_i \times x_i^{1}} \\ \vdots \\ w_0 + \sum{w_i \times x_i^{n}} \end{pmatrix}Lien vers l'original

adding bias in D-dimensions OLS

Xn×(d+1)=(x11x1d1xn1xnd1)X^{'}_{n \times (d+1)} = \begin{pmatrix} x_1^{1} & \cdots & x_1^{d} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ x_n^{1} & \cdots & x_n^{d} & 1 \end{pmatrix}

and

W(d+1)×1=(w1wdw0)W_{(d+1) \times 1} = \begin{pmatrix} w_1 \\ \vdots \\ w_d \\ w_0 \end{pmatrix}

Add an new auxiliary dimension to the input data, xd+1=1x_{d+1} = 1

Solve OLS:

minWRd×1XWY22\min\limits{W \in \mathbb{R}^{d \times 1}} \|XW - Y\|_2^2

Gradient for f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R}

w f=[fw1fwd]\triangledown_{w} \space f = \begin{bmatrix} \frac{\partial f}{\partial w_1} \\ \vdots \\ \frac{\partial f}{\partial w_d} \\ \end{bmatrix}

Jacobian for g:RmRng: \mathbb{R}^m \rightarrow \mathbb{R}^n

w g=[g1w1g1wdgnw1gnwd]n×mu,tRdg(u)=uTv    w g=v (gradient) ARn×n;uRng(u)=uTAu    w g=(A+AT)uT (Jacobian) \begin{aligned} \triangledown_{w} \space g &= \begin{bmatrix} \frac{\partial g_1}{\partial w_1} & \cdots & \frac{\partial g_1}{\partial w_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_n}{\partial w_1} & \cdots & \frac{\partial g_n}{\partial w_d} \end{bmatrix}_{n \times m} \\ \\ &u, t \in \mathbb{R}^d \\ &\because g(u) = u^T v \implies \triangledown_{w} \space g = v \text{ (gradient) } \\ \\ &A \in \mathbb{R}^{n \times n}; u \in \mathbb{R}^n \\ &\because g(u) = u^T A u \implies \triangledown_{w} \space g = (A + A^T) u^T \text{ (Jacobian) } \end{aligned}

result

WLS=(XTX)1XTYW^{\text{LS}} = (X^T X)^{-1} X^T Y
Lien vers l'original

overfitting.

strategies to avoid:

  • add more training data
  • L1 (Lasso) or L2 (Ridge) regularization
    • add a penalty term to the objective function
    • L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0. Loss(w)=Error+λ×w\text{Loss}(w) = \text{Error} + \lambda \times \| w \|
    • L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1 Loss(w)=Error+λ×w2\text{Loss}(w) = \text{Error} + \lambda \times w^2
  • Cross-validation
    • split data into k-fold
  • early stopping
  • dropout, see example
    • randomly selected neurons are ignored makes network less sensitive

sample complexity of learning multivariate polynomials

Lien vers l'original

regularization.

L2 regularization:

minWRdXWY22+λW22\text{min}_{W \in \mathbb{R}^{d}} \| XW - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2}

Solving WRLSW^{\text{RLS}}

Solve that

WRLS=(XTX+λI)1XTYW^{\text{RLS}} = (X^T X + \lambda I)^{-1} X^T Y

Inverse exists as long as λ>0\lambda > 0

Lien vers l'original

polynomial curve-fitting revisited

feature map: ϕ(x):Rd1Rd2\phi{(x)}: R^{d_1} \rightarrow R^{d_2} where d2>>d1d_{2} >> d_{1}

training:

  • W=minWϕWY22+λW22W^{*} = \min\limits{W} \| \phi W - Y \|^{2}_{2} + \lambda \| W \|_{2}^{2}
  • W=(ϕTϕ+λI)1ϕTYW^{*} = (\phi^T \phi + \lambda I)^{-1} \phi^T Y

prediction:

  • y^=W,ϕ(x)=WTϕ(x)\hat{y} = \langle{W^{*}, \phi{(x)}} \rangle = {W^{*}}^T \phi(x)

choices of ϕ(x)\phi(x)

  • Gaussian basis functions: ϕ(x)=exp(xμ22σ2)\phi(x) = \exp{(-\frac{\| x - \mu \|^{2}}{2\sigma^{2}})}
  • Polynomial basis functions: ϕ(x)={1,x,x2,,xd}\phi(x) = \{1, x, x^{2}, \ldots, x^{d}\}
  • Fourier basis functions: DFT, FFT
Lien vers l'original

kernels

compute higher dimension inner products

K(xi,xj)=ϕ(xi),ϕ(xj)K(x^i, x^j) = \langle \phi(x^i), \phi(x^j) \rangle

Polynomial kernels of degree 2:

k(xi,xj)=(1+(xi)Txj)2=(1+xi,xj)2O(d) operationsk(x^i, x^j) = (1 + (x^i)^T x^j)^2 = (1 + \langle{x^i, x^j} \rangle)^2 \\ \\ \because O(d) \text{ operations}

degree M polynomial

k(xi,xj)=(1+(xi)Txj)Mk(x^i, x^j) = (1 + (x^i)^T x^j)^M

How many operations?

  • improved: d+logMd + \log M ops
Lien vers l'original

kernel least squares

Steps:

  • W=minWϕWY22+λW22W^{*} = \min\limits_{W} \|\phi W - Y\|_2^2 + \lambda \| W \|_2^2
  • shows that  aRnW=ϕTa\exists \space a \in \mathbb{R}^n \mid W^{*} = \phi^T a, or W=aiϕ(xi)W^{*} = \sum a_i \phi(x^i)
  • Uses W=aiϕ(xi)W^{*} = \sum a_i \phi(x^i) to form the dual representation of the problem.
minaRnKaY22+λaTKaY^=ϕϕTa=Kn×nan×1\min\limits_{\overrightarrow{a} \in \mathbb{R}^n} \| Ka - Y \|_2^2 + \lambda a^T K a \\ \because \hat{Y} = \phi \phi^T a = K_{n \times n} \dots a_{n \times 1}

Solution:

a=(K+λI)1Ya^{*} = (K + \lambda I)^{-1} Y

choices

  • polynomial kernel: K(x,z)=(1+xTz)dK(x, z) = (1 + x^T z)^d
  • Gaussian kernel: K(x,z)=exz222σ2=eαxz22K(x, z) = e^{-\frac{\|x-z\|_2^2}{2\sigma^2}} = e^{-\alpha \|x-z\|^2_2}

mapping high-dimensional data

minimising reconstruction error

  • Given XRd×nX \in \mathbb{R}^{d \times n}, find AA that minimises the reconstruction error: minA,BixiBAxi22\min\limits_{A,B} \sum_{i} \| x^i - B A x^i \|_2^2

if q=dq=d, then error is zero.

Solution:

  • B=ATB = A^T
  • minAixiATAxi2\min\limits_{A} \sum_i \| x^i - A^T A x^i \|^2 is subjected to AAT=Iq×qA A^T = I_{q \times q}
  • assuming data is centered, or 1n_ixi=[00]T\frac{1}{n} \sum\_{i} x^i = \begin{bmatrix} 0 & \cdots & 0 \end{bmatrix}^T
Lien vers l'original

eigenvalue decomposition

XTXu=λuXTX=UTΛUΛ=diag(λ1,λ2,,λd)=[λ1000λ2000λq]\begin{aligned} X^T X \mathcal{u} &= \lambda \mathcal{u} \\ X^T X &= U^T \Lambda U \\ \\ \\ \because \Lambda &= \text{diag}(\lambda_1, \lambda_2, \cdots, \lambda_d) \\ &= \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_q \end{bmatrix} \end{aligned} Lien vers l'original
l

pca

Idea: given input x1,,xnRdx^1, \cdots, x^n \in \mathbb{R}^d, μ=1nixi\mu = \frac{1}{n} \sum_{i} x^i

Thus

C=(xiμ)(xiμ)TC = \sum (x^i - \mu)(x^i - \mu)^T

Find the eigenvectors/values of CC:

C=UTΛUC = U^T \Lambda U

Optimal AA is:

A=[u1Tu2TuqT]A = \begin{bmatrix} u_1^T \\ u_2^T \\ \vdots \\ u_q^T \end{bmatrix}Lien vers l'original

bayes rules and chain rules

Joint distribution: P(X,Y)P(X,Y)

Conditional distribution of XX given YY: P(XY)=P(X,Y)P(Y)P(X|Y) = \frac{P(X,Y)}{P(Y)}

Bayes rule: P(XY)=P(YX)P(X)P(Y)P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}

Chain rule:

for two events:

P(A,B)=P(BA)P(A)P(A, B) = P(B \mid A)P(A)

generalised:

P(X1,X2,,Xk)=P(X1)j=2nP(XjX1,,Xj1)expansion: P(X1)P(X2X1)P(XkX1,X2,,Xk1)\begin{aligned} &P(X_1, X_2, \ldots , X_k) \\ &= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt] &\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1}) \end{aligned}

i.i.d assumption

assume underlying distribution DD, that train and test sets are independent and identically distributed (i.i.d)

Example: flip a coin

Outcome H=0H=0 or T=1T=1 with P(H)=pP(H) = p and P(T)=1pP(T) = 1-p, or x{0,1}x \in \{0,1\}, xx is the Bernoulli random variable.

P(x=0)=αP(x=0)=\alpha and P(x=1)=1αP(x=1)=1-\alpha

a priori and posterior distribution

Would be maximum likelihood estimate

αML=arg maxP(Xα)=arg minαilog(P(xiα))\alpha^{\text{ML}} = \argmax P(X | \alpha) = \argmin_{\alpha} - \sum_{i} \log (P(x^i | \alpha))

maximum a posteriori estimation

αMAP=arg maxP(αX)=arg maxαP(Xα)P(α)P(X)=arg minα(logP(α))i=1nlogP(xiα)\begin{aligned} \alpha^{\text{MAP}} &= \argmax P(\alpha | X) \\ &= \argmax_{\alpha} \frac{P(X|\alpha)P(\alpha)}{P(X)} \\ &= \argmin_{\alpha}(-\log P(\alpha)) - \sum_{i=1}^{n} \log P(x^i | \alpha) \end{aligned} arg maxWP(xα)P(α)=arg maxW[logP(α)+ilog(xi,yiW)]=arg maxW[ln1βλW22(xiTWyi)2σ2]\begin{aligned} \argmax_{W} P(x | \alpha) P (\alpha) &= \argmax_{W} [\log P(\alpha) + \sum_{i} \log (x^i, y^i | W)] \\ &= \argmax_{W} [\ln \frac{1}{\beta} - \lambda {\parallel W \parallel}_{2}^{2} - \frac{({x^i}^T W - y^i)^2}{\sigma^2}] \end{aligned} P(W)=1βeλW22P(W) = \frac{1}{\beta} e^{\lambda \parallel W \parallel_{2}^{2}}

What if we have

P(W)=1βeλW22r2P(W) = \frac{1}{\beta} e^{\frac{\lambda \parallel W \parallel_{2}^{2}}{r^2}}
arg maxWP(Zα)=arg maxWlogP(xi,yiW)\argmax_{W} P(Z | \alpha) = \argmax_{W} \sum \log P(x^i, y^i | W) P(yx,W)=1γe(xTWy)22σ2P(y | x, W) = \frac{1}{\gamma} e^{-\frac{(x^T W-y)^2}{2 \sigma^2}} Lien vers l'original

expected error minimisation

think of it as bias-variance tradeoff

Squared loss: l(y^,y)=(yy^)2l(\hat{y},y)=(y-\hat{y})^2

solution to y=arg miny^EX,Y(Yy^(X))2y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 is E[YX=x]E[Y | X=x]

Instead we have Z={(xi,yi)}i=1nZ = \{(x^i, y^i)\}^n_{i=1}

error decomposition

Ex,y(yyZ^(x))2=Exy(yy(x))2+Ex(y(x)yZ^(x))2=noise+estimation error\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

EZEx,y(y(y^Z(x)WZTx))2=Ex,y(yy(x))2noise+Ex(y(x)EZ(yZ^(x)))2bias+ExEZ(yZ^(x)EZ(yZ^(x)))2variance\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}Lien vers l'original

nearest neighbour

See also: slides 13, slides 14, slides 15

expected error minimisation

think of it as bias-variance tradeoff

Squared loss: l(y^,y)=(yy^)2l(\hat{y},y)=(y-\hat{y})^2

solution to y=arg miny^EX,Y(Yy^(X))2y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 is E[YX=x]E[Y | X=x]

Instead we have Z={(xi,yi)}i=1nZ = \{(x^i, y^i)\}^n_{i=1}

error decomposition

Ex,y(yyZ^(x))2=Exy(yy(x))2+Ex(y(x)yZ^(x))2=noise+estimation error\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

EZEx,y(y(y^Z(x)WZTx))2=Ex,y(yy(x))2noise+Ex(y(x)EZ(yZ^(x)))2bias+ExEZ(yZ^(x)EZ(yZ^(x)))2variance\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}Lien vers l'original

accuracy

zero-one loss:

l01(y,y^)=1yy^={1yy^ 0y=y^l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}

linear classifier

y^W(x)=sign(WTx)=1WTx0W^=arg minWLZ01(y^W)\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}

surrogate loss functions

assume classifier returns a discrete value y^W=sign(WTx){0,1}\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}

What if classifier's output is continuous?

y^\hat{y} will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set Z={(xi,yi)}i=1nZ=\{(x^i, y^i)\}_{i=1}^{n} is linearly separable if there exists a WW^{*} such that:

  • i[n]SGN(<xi,W>)=yi\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i
  • Or, for every i[n]i \in [n] we have (WTxi)yi>0(W^{* T}x^i)y^i > 0

linear programming

maxWRdu,w=i=1duiwis.t Awv\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}

Given that data is linearly separable

 Wi[n],(WTxi)yi>0 W,γ>0i[n],(WTxi)yiγ Wi[n],(WTxi)yi1\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}

LP for linear classification

  • Define A=[xjiyi]n×dA = [x_j^iy^i]_{n \times d}

  • find optimal WW equivalent to

    maxwRd0,ws.t. Aw1\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}

perceptron

Rosenblatt’s perceptron algorithm

"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 Batch Perceptron

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:Initialize w(1)=(0,,0)\mathbf{w}^{(1)} = (0,\ldots,0)

2:for t=1,2,t = 1,2,\ldots do

3:if ( i s.t. yiw(t),xi0)(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) then

4:w(t+1)=w(t)+yixi\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i

5:else

6:output w(t)\mathbf{w}^{(t)}

7:end if

8:end for

greedy update

WnewTxiyi=Wold+yixi,xiyi=WoldTxiyi+xi22yiyi\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}

proof

See also (Novikoff, 1962)

Theorem

Assume there exists some parameter vector θ\underline{\theta}^{*} such that θ=1\|\underline{\theta}^{*}\| = 1 and  γ>0 s.t \exists \space \upgamma > 0 \text{ s.t }

yt(xtθ)γy_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma

Assumption:  t=1n,xtR\forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R

Then perceptron makes at most R2γ2\frac{R^2}{\upgamma^2} errors

proof by induction

definition of θk\underline{\theta^k}

to be parameter vector where algorithm makes kthk^{\text{th}} error.

Note that we have θ1=0\underline{\theta^{1}}=\underline{0}

Assume that kthk^{\text{th}} error is made on example tt, or

θk+1θ=(θk+ytxt)θ=θkθ+ytxtθθkθ+γ Assumption: ytxtθγ\begin{align} \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\ &= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\ &\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt] &\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma \end{align}

Follows up by induction on kk that

θk+1θkγ\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma

Using Cauchy-Schwarz we have θk+1×θθk+1θ\|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}}

θk+1kγθ=1\begin{align} \|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt] &\because \|\underline{\theta^{*}}\| = 1 \end{align}

In the second part, we will find upper bound for (5):

θk+12=θk+ytxt2=θk2+yt2xt2+2ytxtθkθk2+R2\begin{align} \|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\ &= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\ &\le \|\underline{\theta^k}\|^2 + R^2 \end{align}

(9) is due to:

  • yt2xt22=xt2R2y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2 by assumption of theorem
  • ytxtθk0y_t \underline{x_t} \cdot \underline{\theta^k} \le 0 given parameter vector θk\underline{\theta^k} gave error at ttht^{\text{th}} example.

Follows with induction on kk that

θk+12kR2\begin{align} \|\underline{\theta^{k+1}}\|^2 \le kR^2 \end{align}

from (5) and (10) gives us

k2γ2θk+12kR2kR2γ2\begin{aligned} k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\ k &\le \frac{R^2}{\upgamma^2} \end{aligned}

Bibliographie

Lien vers l'original

linear algebra review.

Diagonal matrix: every entry except the diagonal is zero.

A=[a1000a2000an]A = \begin{bmatrix} a_{1} & 0 & \cdots & 0 \\ 0 & a_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_{n} \end{bmatrix}

trace: sum of the entries in main diagonal: tr(A)=i=1naii\text{tr}(A) = \sum_{i=1}^{n} a_{ii}

Properties of transpose:

(AT)T=A(A+B)T=AT+BT(AB)T=BTAT\begin{aligned} (A^T)^T &= A \\ (A + B)^T &= A^T + B^T \\ (AB)^T &= B^T A^T \end{aligned}

Properties of inverse:

(A1)1=A(AB)1=B1A1(AT)1=(A1)T\begin{aligned} (A^{-1})^{-1} &= A \\ (AB)^{-1} &= B^{-1} A^{-1} \\ (A^T)^{-1} &= (A^{-1})^T \end{aligned}

Inverse of a matrix

if a matrix A1A^{-1} exists, mean A is invertible (non-singular), and vice versa.

quadratic form

Given a square matrix ARn×nA \in \mathbb{R}^{n \times n}, the quadratic form is defined as: xTAxRx^TAx \in \mathbb{R}

xTAx=i=1nj=1naijxixjx^TAx = \sum_{i=1}^{n} \sum_{j=1}^{n} a_{ij} x_i x_j

norms

A function f:RnRf : \mathbb{R}^n \Rightarrow \mathbb{R} is a norm if it satisfies the following properties:

  • non-negativity: xRn,f(x)>0\forall x \in \mathbb{R}^n, f(x) > 0
  • definiteness: f(x)=0    x=0f(x) = 0 \iff x=0
  • Homogeneity: xRn,tR,f(tx)tf(x)\forall x \in \mathbb{R}^n, t\in \mathbb{R}, f(tx) \leq \mid t\mid f(x)
  • triangle inequality: x,yRn,f(x+y)f(x)+f(y)\forall x, y \in \mathbb{R}^n, f(x+y) \leq f(x) + f(y)

symmetry

A square matrix ARn×nA \in \mathbb{R}^{n \times n} is symmetric if A=ATASnA = A^T \mid A \in \mathbb{S}^n

Anti-semi-symmetric if A=ATAA = -A^T \mid A

Given any square matrix ARn×nA \in \mathbb{R}^{n \times n}, the matrix A+ATA + A^T is symmetric, and AATA - A^T is anti-symmetric.

A=12(A+AT)+12(AAT)A = \frac{1}{2}(A+A^T) + \frac{1}{2}(A-A^T)

positive definite

AA is positive definite if xTAx>0xRnx^TAx > 0 \forall x \in \mathbb{R}^n.

  • It is denoted by A0A \succ 0.
  • The set of all positive definite matrices is denoted by S++n\mathbb{S}^n_{++}

positive semi-definite

AA is positive semi-definite if xTAx0xRnx^TAx \geq 0 \forall x \in \mathbb{R}^n.

  • It is denoted by A0A \succeq 0.
  • The set of all positive semi-definite matrices is denoted by S+n\mathbb{S}^n_{+}

negative definite

AA is negative definite if xTAx<0xRnx^TAx < 0 \forall x \in \mathbb{R}^n.

  • It is denoted by A0A \prec 0.
  • The set of all negative definite matrices is denoted by Sn\mathbb{S}^n_{--}

negative semi-definite

AA is negative semi-definite if xTAx0xRnx^TAx \leq 0 \forall x \in \mathbb{R}^n.

  • It is denoted by A0A \preceq 0.
  • The set of all negative semi-definite matrices is denoted by Sn\mathbb{S}^n_{-}

A symmetric matrix ASnA \in \mathbb{S}^n is indefinite if it is neither positive semi-definite or negative semi-definite.

x1,x2Rn  x1TAx1>0 and x2TAx2<0\exists x_1, x_2 \in \mathbb{R}^n \space \mid \space x_1^TAx_1 > 0 \space and \space x_2^TAx_2 < 0

Given any matrix ARm×nA \in \mathbb{R}^{m \times n}, the matrix G=ATAG = A^TA is always positive semi-definite (known as the Gram matrix)

Proof: xTGx=xTATAx=(Ax)T(Ax)=Ax220x^TGx = x^TA^TAx = (Ax)^T(Ax) = \|Ax\|_2^2 \geq 0

eigenvalues and eigenvectors

The non-zero vector xCnx \in \mathbb{C}^n is an eigenvector of A and λC\lambda \in \mathbb{C} is called the eigenvalue of A if:

Ax=λxAx = \lambda x

finding eigenvalues

 non-zero eigenvector xC     null space of (AλI) is non-empty    AλI is singular AλI=0\begin{aligned} \exists \text{ non-zero eigenvector } x \in \mathbb{C} & \iff \text{ null space of } (A - \lambda I) \text{ is non-empty} \\ \implies \mid A - \lambda I \mid \text{ is singular } \\ \mid A - \lambda I \mid &= 0 \end{aligned}

Solving eigenvectors via (AλiI)xi=0(A-\lambda_{i}I)x_i=0

See also matrix cookbook

matrix representation of a system of linear equations

x1+x2+x3=5x12x23x3=12x1+x2x3=3\begin{aligned} x_1 + x_2 + x_3 &= 5 \\ x_1 - 2x_2 - 3x_3 &= -1 \\ 2x_1 + x_2 - x_3 &= 3 \end{aligned}

Equivalent matrix representation of Ax=bAx = b

A=[111123211]x=[x1x2x3]b=[513]ARm×n,xRn,bRm\begin{aligned} A &= \begin{bmatrix} 1 & 1 & 1 \\ 1 & -2 & -3 \\ 2 & 1 & -1 \end{bmatrix} \\ x &= \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \\ b &= \begin{bmatrix} 5 \\ -1 \\ 3 \end{bmatrix} \end{aligned} \because A \in R^{m \times n}, x \in R^n, b \in R^m

Transpose of a matrix

ARm×nA \in R^{m \times n} and ATRn×mA^T \in R^{n \times m}

dot product.

x,y=i=1nxiyi=i=1nxiyi\begin{aligned} \langle x, y \rangle &= \sum_{i=1}^{n} x_i y_i \\ &= \sum_{i=1}^{n} x_i \cdot y_i \end{aligned}

linear combination of columns

Let ARm×nA \in R^{m \times n}, XRnX \in R^n, AxRnAx \in R^n

Then Ax=i=1naixiRnAx = \sum_{i=1}^{n}{\langle a_i \rangle} x_i \in R^n

inverse of a matrix

The inverse of a square matrix ARn×nA \in R^{n \times n} is a unique matrix denoted by A1Rn×nA^{-1} \in \mathbb{R}^{n\times{n}}

A1A=I=AA1A^{-1} A = I = A A^{-1}

euclidean norm

L2L_{2} norm:

x2=i=1nxi2=XTX\| x \|_{2} = \sqrt{\sum_{i=1}^{n}{x_i^2}} = X^TX

L1 norm: x1=i=1nxi\| x \|_{1} = \sum_{i=1}^{n}{|x_i|}

LL_{\infty} norm: x=maxixi\| x \|_{\infty} = \max_{i}{|x_i|}

p-norm: xp=(i=1nxip)1/p\| x \|_{p} = (\sum_{i=1}^{n}{|x_i|^p})^{1/p}

Comparison

xx2x_1 \|x\|_{\infty} \leq \|x\|_{2} \leq \|x\|\_{1}

One can prove this with Cauchy-Schwarz inequality

linear dependence of vectors

Given {x1,x2,,xn}Rd\{x_1, x_2, \ldots, x_n\} \subseteq \mathbb{R}^d and α1,α2,,αnR\alpha_1, \alpha_2, \ldots, \alpha_n \in \mathbb{R}

i[n],{a1,a2,,an}Rd s.t. xij=1najxj\forall i \in [ n ], \forall \{a_1, a_2, \ldots, a_n\} \subseteq \mathbb{R}^d \space s.t. \space x_i \neq \sum_{j=1}^{n}{a_j x_j}

Span

Given a set of vectors {x1,x2,,xn}Rd\{x_1, x_2, \ldots, x_n\} \subseteq \mathbb{R}^d, the span of the set is the set of all possible linear combinations of the vectors.

span({x1,x2,,xn})={y:y=i=1nαixiαiR}\text{span}(\{x_1, x_2, \ldots, x_n\}) = \{ y: y = \sum_{i=1}^{n}{\alpha_i x_i} \mid \alpha_i \in \mathbb{R} \}

If x1,x2,,xnx_{1}, x_{2}, \ldots, x_{n} are linearly independent, then the span of the set is the entire space Rd\mathbb{R}^d

Rank

For a matrix ARm×nA \in \mathbb{R}^{m \times n}:

  • column rank: max number of linearly independent columns of AA
  • row rank: max number of linearly independent rows of AA

If rank(A)m\text{rank}(A) \leq m, then the rows are linearly independent. If rank(A)n\text{rank}(A) \leq n, then the columns are linearly independent.

rank of a matrix AA is the number of linearly independent columns of AA:

  • if AA is full rank, then rank(A)=min(m,n)\text{rank}(A) = \min(m, n) (rank(A)min(m,n)\text{rank}(A) \leq \min(m, n))
  • rank(A)=rank(AT)\text{rank}(A) = \text{rank}(A^T)

solving linear system of equations

If ARnA \in \mathbb{R}^{n} is invertible, there exists a solution:

x=A1bx = A^{-1}b

Range and Projection

Given a matrix ARm×nA \in \mathbb{R}^{m \times n}, the range of AA, denoted by R(A)\mathcal{R}(A) is the span of columns of AA:

R(A)={yRmy=AxxRm}\mathcal{R}(A) = \{ y \in \mathbb{R}^m \mid y = Ax \mid x \in \mathbb{R}^m \}

Projection of a vector yRmy \in \mathbb{R}^m onto span({x1,,xn})\text{span}(\{x_1, \cdots, x_n\}), xiRmx_i \in \mathbb{R}^m is a vector in the span that is as close as possible to yy wrt l2l_2 norm

Proj(y;{x1,,xn})=arg minvspan({x1,,xn})yv2\text{Proj}(y; \{x_{1}, \cdots, x_n\}) = \argmin_{{v \in \text{span}(\{x_1, \cdots, x_n\})}} \| y - v \|_2

Null space of AA

is the set of all vectors that satisfies the following:

N(A)={xRnAx=0}\mathcal{N}(A) = \{ x \in \mathbb{R}^n \mid Ax = 0 \}Lien vers l'original

probability theory

With Bayes rules we have

P(YX)=P(XY)P(Y)P(X)P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}

Chain rule states for event A1,AnA_1, \ldots A_n:

P(A1A2An)=P(AnAn1A1)P(An1A1)=P(A1)i=2nP(Aij=1i1Aj)\begin{aligned} P(A_1 \cap A_2 \cap \ldots \cap A_n) &= P(A_n|A_{n-1} \cap \ldots \cap A_1)P(A_{n-1} \cap \ldots \cap A_1) \\ &= P(A_1) \prod_{i=2}^{n} P(A_i|\cap_{j=1}^{i-1} A_j) \end{aligned}

Law of Total Probability

If B1,,BnB_{1}, \ldots , B_{n} are finite partition of the same space, or ij,BiBj=i=1nBi=Ω\forall i \neq j, B_i \cap B_j = \emptyset \land \cup_{i=1}^{n} B_i = \Omega, then law of total probability state that for an event A

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A|B_i)P(B_i)

cumulative distribution function

For a random variable X, a CDF FX(x):R[0,1]F_X(x): \mathbb{R} \rightarrow [0,1] is defined as:

FX(x)P(Xx)F_X(x) \coloneqq P(X \leq x)
  • 0<FX(x)<10<F_X(x)<1
  • P(aXb)=FX(b)FX(a)P(a \leq X \leq b) =F_X(b) -F_X(a)

probability mass function

for a discrete random variable X, the probability mass function pX(x):R[0,1p_X(x) : \mathbb{R} \rightarrow [0,1 is defined as:

pX(x)P(X=x)p_X(x) \coloneqq P(X=x)
  • 0pX(x)10 \leq p_X(x) \leq 1
  • xDpX(x)=1,D is a set of all possible values of X\sum_{x \in \mathbb{D}} p_X(x) = 1, \mathbb{D} \text{ is a set of all possible values of X}
  • P(XA)=P({ω:X(ω)A})=xApX(x)P(X \in A) = P(\{\omega: X(\omega) \in A\}) = \sum_{x \in A} p_X(x)

probability density function

for a continuous random variable X, the probability density function fX(x):R[0,)f_X(x) : \mathbb{R} \rightarrow [0, \infty) is defined as:

fX(x)dFX(x)dxf_X(x) \coloneqq \frac{d F_X(x)}{dx}
  • fX(x)0f_X(x) \geq 0
  • FX(x)=xfX(x)dxF_X(x) = \int_{-\infty}^{x}f_X(x)dx

Expectation

for a discrete random variable with PMF pX(x)p_X(x) and g(x):RRg(x): \mathbb{R} \rightarrow \mathbb{R}, the expectation of g(x)g(x) is:

E[g(X)]=xDg(x)pX(x)\mathbb{E}[g(X)] = \sum_{x \in \mathbb{D}} g(x) p_X(x)

for a continuous random variable with PDF fX(x)f_X(x), the expectation of g(x)g(x) is:

E[g(X)]=g(x)fX(x)dx\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx

Therefore, mean of a random variable X is E[X]\mathbb{E}[X]:

μ=E[X]=xfX(x)dx\mu = \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) dx

Variance of a random variable X is:

σ2=E[(Xμ)2]=E[X2]E[X]2\sigma^2 = \mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2
  • Var(f(X)+c)=Var(f(X))\text{Var}(f(X)+c)=\text{Var}(f(X))
  • Var(cf(X))=c2Var(f(X))\text{Var}(cf(X)) = c^2 \text{Var}(f(X))

discrete random variables

Bernoulli distribution: XBernoulli(p),0p1X \sim \text{Bernoulli}(p), 0 \le p \le 1

pX(x)={pif x=11pif x=0E[X]=pVar(X)=p(1p)\begin{aligned} p_X(x) &= \begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases} \\ \\ \mathbb{E}[X] &= p \\ \text{Var}(X) &= p(1-p) \end{aligned}

Binomial distribution: XBinomial(n,p),0p1X \sim \text{Binomial}(n,p), 0 \le p \le 1

pX(x)=(nx)px(1p)nx(nx)=n!x!(nx)!E[X]=npVar(X)=np(1p)\begin{aligned} p_X(x) &= \binom{n}{x} p^x (1-p)^{n-x} \\ \\ \because \binom{n}{x} &= \frac{n!}{x!(n-x)!} \\ \mathbb{E}[X] &= np \\ \text{Var}(X) &= np(1-p) \end{aligned}

Poisson distribution: XPoisson(λ),λ>0X \sim \text{Poisson}(\lambda), \lambda > 0

pX(x)=eλλxx!E[X]=λVar(X)=λ\begin{aligned} p_X(x) &= \frac{e^{-\lambda} \lambda^x}{x!} \\ \mathbb{E}[X] &= \lambda \\ \text{Var}(X) &= \lambda \end{aligned}

continuous random variables

Uniform distribution: XUnif(a,b),abX \sim \text{Unif}(a,b), a \le b

fX(x)={1baif axb0otherwiseE[X]=a+b2Var(X)=(ba)212\begin{aligned} f_X(x) &= \begin{cases} \frac{1}{b-a} & \text{if } a \le x \le b \\ 0 & \text{otherwise} \end{cases} \\ \\ \mathbb{E}[X] &= \frac{a+b}{2} \\ \text{Var}(X) &= \frac{(b-a)^2}{12} \end{aligned}

Exponential distribution: XExp(λ),λ>0X \sim \text{Exp}(\lambda), \lambda > 0

fX(x)=λeλxE[X]=1λVar(X)=1λ2\begin{aligned} f_X(x) = \lambda e^{-\lambda x} \\ \\ \mathbb{E}[X] &= \frac{1}{\lambda} \\ \text{Var}(X) &= \frac{1}{\lambda^2} \end{aligned}

Gaussian distribution: XN(μ,σ2),<μ<,σ2>0X \sim \mathcal{N}(\mu, \sigma^2), -\infty < \mu < \infty, \sigma^2 > 0

pX(x)=12πσ2e(xμ)22σ2E[X]=μVar(X)=σ2\begin{aligned} p_X(x) &= \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \\ \\ \mathbb{E}[X] &= \mu \\ \text{Var}(X) &= \sigma^2 \end{aligned}