profile pic
⌘ '
raccourcis clavier

See also some statistical theory

bayes rules and chain rules

Joint distribution: P(X,Y)P(X,Y)

Conditional distribution of XX given YY: P(XY)=P(X,Y)P(Y)P(X|Y) = \frac{P(X,Y)}{P(Y)}

Bayes rule: P(XY)=P(YX)P(X)P(Y)P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}

Chain rule:

for two events:

P(A,B)=P(BA)P(A)P(A, B) = P(B \mid A)P(A)

generalised:

P(X1,X2,,Xk)=P(X1)j=2nP(XjX1,,Xj1)expansion: P(X1)P(X2X1)P(XkX1,X2,,Xk1)\begin{aligned} &P(X_1, X_2, \ldots , X_k) \\ &= P(X_1) \prod_{j=2}^{n} P(X_j \mid X_1,\dots,X_{j-1}) \\[12pt] &\because \text{expansion: }P(X_1)P(X_2|X_1)\ldots P(X_k|X_1,X_2,\ldots,X_{k-1}) \end{aligned}

i.i.d assumption

assume underlying distribution DD, that train and test sets are independent and identically distributed (i.i.d)

Example: flip a coin

Outcome H=0H=0 or T=1T=1 with P(H)=pP(H) = p and P(T)=1pP(T) = 1-p, or x{0,1}x \in \{0,1\}, xx is the Bernoulli random variable.

P(x=0)=αP(x=0)=\alpha and P(x=1)=1αP(x=1)=1-\alpha

Lien vers l'originalLien vers l'original

Note that for any random variables A,B,CA,B,C we have:

P(A,BC)=P(AB,C)P(BC)P(A,B \mid C) = P(A\mid B,C) P(B \mid C)

nearest neighbour

See also: slides 13, slides 14, slides 15

expected error minimisation

think of it as bias-variance tradeoff

Squared loss: l(y^,y)=(yy^)2l(\hat{y},y)=(y-\hat{y})^2

solution to y=arg miny^EX,Y(Yy^(X))2y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 is E[YX=x]E[Y | X=x]

Instead we have Z={(xi,yi)}i=1nZ = \{(x^i, y^i)\}^n_{i=1}

error decomposition

Ex,y(yyZ^(x))2=Exy(yy(x))2+Ex(y(x)yZ^(x))2=noise+estimation error\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

EZEx,y(y(y^Z(x)WZTx))2=Ex,y(yy(x))2noise+Ex(y(x)EZ(yZ^(x)))2bias+ExEZ(yZ^(x)EZ(yZ^(x)))2variance\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}Lien vers l'originalLien vers l'original

accuracy

zero-one loss:

l01(y,y^)=1yy^={1yy^ 0y=y^l^{0-1}(y, \hat{y}) = 1_{y \neq \hat{y}}= \begin{cases} 1 & y \neq \hat{y} \\\ 0 & y = \hat{y} \end{cases}

linear classifier

y^W(x)=sign(WTx)=1WTx0W^=arg minWLZ01(y^W)\begin{aligned} \hat{y}_W(x) &= \text{sign}(W^T x) = 1_{W^T x \geq 0} \\[8pt] &\because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W) \end{aligned}

surrogate loss functions

assume classifier returns a discrete value y^W=sign(WTx){0,1}\hat{y}_W = \text{sign}(W^T x) \in \{0,1\}

What if classifier's output is continuous?

y^\hat{y} will also capture the “confidence” of the classifier.

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linearly separable data

linearly separable

A binary classification data set Z={(xi,yi)}i=1nZ=\{(x^i, y^i)\}_{i=1}^{n} is linearly separable if there exists a WW^{*} such that:

  • i[n]SGN(<xi,W>)=yi\forall i \in [n] \mid \text{SGN}(<x^i, W^{*}>) = y^i
  • Or, for every i[n]i \in [n] we have (WTxi)yi>0(W^{* T}x^i)y^i > 0

linear programming

maxWRdu,w=i=1duiwis.t Awv\begin{aligned} \max_{W \in \mathbb{R}^d} &\langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ &\text{s.t } A w \ge v \end{aligned}

Given that data is linearly separable

 Wi[n],(WTxi)yi>0 W,γ>0i[n],(WTxi)yiγ Wi[n],(WTxi)yi1\begin{aligned} \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0 \\ \exists \space W^{*}, \gamma > 0 &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma \\ \exists \space W^{*} &\mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1 \end{aligned}

LP for linear classification

  • Define A=[xjiyi]n×dA = [x_j^iy^i]_{n \times d}

  • find optimal WW equivalent to

    maxwRd0,ws.t. Aw1\begin{aligned} \max_{w \in \mathbb{R}^d} &\langle{\vec{0}, w} \rangle \\ & \text{s.t. } Aw \ge \vec{1} \end{aligned}

perceptron

Rosenblatt’s perceptron algorithm

"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 Batch Perceptron

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:Initialize w(1)=(0,,0)\mathbf{w}^{(1)} = (0,\ldots,0)

2:for t=1,2,t = 1,2,\ldots do

3:if ( i s.t. yiw(t),xi0)(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) then

4:w(t+1)=w(t)+yixi\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i

5:else

6:output w(t)\mathbf{w}^{(t)}

7:end if

8:end for

greedy update

WnewTxiyi=Wold+yixi,xiyi=WoldTxiyi+xi22yiyi\begin{aligned} W_{\text{new}}^T x^i y^i &= \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i \\ &=W_{\text{old}}^T x^{i} y^{i} + \|x^i\|_2^2 y^{i} y^{i} \end{aligned}

proof

See also (Novikoff, 1962)

Theorem

Assume there exists some parameter vector θ\underline{\theta}^{*} such that θ=1\|\underline{\theta}^{*}\| = 1 and  γ>0 s.t \exists \space \upgamma > 0 \text{ s.t }

yt(xtθ)γy_t(\underline{x_t} \cdot \underline{\theta^{*}}) \ge \upgamma

Assumption:  t=1n,xtR\forall \space t= 1 \ldots n, \|\underline{x_t}\| \le R

Then perceptron makes at most R2γ2\frac{R^2}{\upgamma^2} errors

proof by induction

definition of θk\underline{\theta^k}

to be parameter vector where algorithm makes kthk^{\text{th}} error.

Note that we have θ1=0\underline{\theta^{1}}=\underline{0}

Assume that kthk^{\text{th}} error is made on example tt, or

θk+1θ=(θk+ytxt)θ=θkθ+ytxtθθkθ+γ Assumption: ytxtθγ\begin{align} \underline{\theta^{k+1}} \cdot \underline{\theta^{*}} &= (\underline{\theta^k} + y_t \underline{x_t}) \cdot \underline{\theta^{*}} \\ &= \underline{\theta^k} \cdot \underline{\theta^{*}} + y_t \underline{x^t} \cdot \underline{\theta^{*}} \\ &\ge \underline{\theta^k} \cdot \underline{\theta^{*}} + \upgamma \\[12pt] &\because \text{ Assumption: } y_t \underline{x_t} \cdot \underline{\theta^{*}} \ge \upgamma \end{align}

Follows up by induction on kk that

θk+1θkγ\underline{\theta^{k+1}} \cdot \underline{\theta^{*}} \ge k \upgamma

Using Cauchy-Schwarz we have θk+1×θθk+1θ\|\underline{\theta^{k+1}}\| \times \|\underline{\theta^{*}}\| \ge \underline{\theta^{k+1}} \cdot \underline{\theta^{*}}

θk+1kγθ=1\begin{align} \|\underline{\theta^{k+1}}\| &\ge k \upgamma \\[16pt] &\because \|\underline{\theta^{*}}\| = 1 \end{align}

In the second part, we will find upper bound for (5):

θk+12=θk+ytxt2=θk2+yt2xt2+2ytxtθkθk2+R2\begin{align} \|\underline{\theta^{k+1}}\|^2 &= \|\underline{\theta^k} + y_t \underline{x_t}\|^2 \\ &= \|\underline{\theta^k}\|^2 + y_t^2 \|\underline{x_t}\|^2 + 2 y_t \underline{x_t} \cdot \underline{\theta^k} \\ &\le \|\underline{\theta^k}\|^2 + R^2 \end{align}

(9) is due to:

  • yt2xt22=xt2R2y_t^2 \|\underline{x_t}^2\|^2 = \|\underline{x_t}^2\| \le R^2 by assumption of theorem
  • ytxtθk0y_t \underline{x_t} \cdot \underline{\theta^k} \le 0 given parameter vector θk\underline{\theta^k} gave error at ttht^{\text{th}} example.

Follows with induction on kk that

θk+12kR2\begin{align} \|\underline{\theta^{k+1}}\|^2 \le kR^2 \end{align}

from (5) and (10) gives us

k2γ2θk+12kR2kR2γ2\begin{aligned} k^2 \upgamma^2 &\le \|\underline{\theta^{k+1}}\|^2 \le kR^2 \\ k &\le \frac{R^2}{\upgamma^2} \end{aligned}Lien vers l'original

Support Vector Machine

idea: maximises margin and more robust to “perturbations”

Euclidean distance between two points xx and the hyperplane parametrised by WW is:

WTx+bW2\frac{\mid W^T x + b \mid }{\|W\|_2}

Assuming W2=1\| W \|_2=1 then the distance is WTx+b\mid W^T x + b \mid

regularization

SVMs are good for high-dimensional data

We can probably use a solver, or gradient descent

maximum margin hyperplane

WW has γ\gamma margin if

WTx+bγ  blue xWTx+bγ  red x\begin{aligned} W^T x + b \ge \gamma \space &\forall \text{ blue x} \\ W^T x +b \le - \gamma \space &\forall \text{ red x} \end{aligned}

Margin:

Z={(xi,yi)}i=1n,y{1,1},W2=1Z = \{(x^{i}, y^{i})\}_{i=1}^{n}, y \in \{-1, 1\}, \|W\|_2 = 1

hard-margin SVM

this is the version with bias

"\\begin{algorithm}\n\\caption{Hard-SVM}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{solve:} $(w_{0},b_{0}) = \\argmin\\limits_{(w,b)} \\|w\\|^2 \\text{ s.t } \\forall i, y_{i}(\\langle{w,x_i} \\rangle + b) \\ge 1$\n\\STATE \\textbf{output:} $\\hat{w} = \\frac{w_0}{\\|w_0\\|}, \\hat{b} = \\frac{b_0}{\\|w_0\\|}$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 2 Hard-SVM

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:solve: (w0,b0)=arg min(w,b)w2 s.t i,yi(w,xi+b)1(w_{0},b_{0}) = \argmin\limits_{(w,b)} \|w\|^2 \text{ s.t } \forall i, y_{i}(\langle{w,x_i} \rangle + b) \ge 1

2:output: w^=w0w0,b^=b0w0\hat{w} = \frac{w_0}{\|w_0\|}, \hat{b} = \frac{b_0}{\|w_0\|}

Note that this version is sensitive to outliers

it assumes that training set is linearly separable

soft-margin SVM

can be applied even if the training set is not linearly separable

"\\begin{algorithm}\n\\caption{Soft-SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{parameter:} $\\lambda > 0$\n\\STATE \\textbf{solve:} $\\min_{\\mathbf{w}, b, \\boldsymbol{\\xi}} \\left( \\lambda \\|\\mathbf{w}\\|^2 + \\frac{1}{m} \\sum_{i=1}^m \\xi_i \\right)$\n\\STATE \\textbf{s.t: } $\\forall i, \\quad y_i (\\langle \\mathbf{w}, \\mathbf{x}_i \\rangle + b) \\geq 1 - \\xi_i \\quad \\text{and} \\quad \\xi_i \\geq 0$\n\\STATE \\textbf{output:} $\\mathbf{w}, b$\n\\end{algorithmic}\n\\end{algorithm}"