See also: slides 13, slides 14, slides 15

y^W(x)=sign(WTx)=1WTx0W^=arg minWLZ01(y^W)\hat{y}_W(x) = \text{sign}(W^T x) = 1_{W^T x \geq 0} \\ \\ \because \hat{W} = \argmin_{W} L_{Z}^{0-1} (\hat{y}_W)

Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.

linear programming

maxWRdu,w=i=1duiwis.t Awv\max_{W \in \mathbb{R}^d} \langle{u, w} \rangle = \sum_{i=1}^{d} u_i w_i \\ \\ \text{s.t } A w \ge v

Given that data is linearly separable

Wi[n],(WTxi)yi>0\exists W^{*} \mid \forall i \in [n], ({W^{*}}^T x^i)y^i > 0

So

W,γ>0i[n],(WTxi)yiγ\exists W^{*}, \gamma > 0 \mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge \gamma

So

Wi[n],(WTxi)yi1\exists W^{*} \mid \forall i \in [n], ({W^{*}}^T x^i)y^i \ge 1

perceptron

Rosenblatt’s perceptron algorithm

"\\begin{algorithm}\n\\caption{Batch Perceptron}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE Initialize $\\mathbf{w}^{(1)} = (0,\\ldots,0)$\n\\FOR{$t = 1,2,\\ldots$}\n \\IF{$(\\exists \\space i \\text{ s.t. } y_i\\langle\\mathbf{w}^{(t)}, \\mathbf{x}_i\\rangle \\leq 0)$}\n \\STATE $\\mathbf{w}^{(t+1)} = \\mathbf{w}^{(t)} + y_i\\mathbf{x}_i$\n \\ELSE\n \\STATE \\textbf{output} $\\mathbf{w}^{(t)}$\n \\STATE \\textbf{break}\n \\ENDIF\n\\ENDFOR\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 6 Batch Perceptron

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:Initialize w(1)=(0,,0)\mathbf{w}^{(1)} = (0,\ldots,0)

2:for t=1,2,t = 1,2,\ldots do

3:if ( i s.t. yiw(t),xi0)(\exists \space i \text{ s.t. } y_i\langle\mathbf{w}^{(t)}, \mathbf{x}_i\rangle \leq 0) then

4:w(t+1)=w(t)+yixi\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + y_i\mathbf{x}_i

5:else

6:output w(t)\mathbf{w}^{(t)}

7:break

8:end if

9:end for

greedy update

WnewTxiyi=Wold+yixi,xiyiW_{\text{new}}^T x^i y^i = \langle W_{\text{old}}+ y^i x^i, x^i \rangle y^i

SVM

idea: maximizes margin and more robus to “perturbations”

Eucledian distance between two points xx and the hyperplan parametrized by WW is:

WTx+bW2\frac{\mid W^T x + b \mid }{\|W\|_2}

Assuming W2=1\| W \|_2=1 then the distance is WTx+b\mid W^T x + b \mid

maximum margin hyperplane

WW has γ\gamma margin if

  • WTx+bγ blue xW^T x + b \ge \gamma \forall \text{ blue x}
  • WTx+bγ red xW^T x +b \le - \gamma \forall \text{ red x}

Margin:

Z={(xi,yi)}i=1n,y{1,1},W2=1Z = \{(x^{i}, y^{i})\}_{i=1}^{n}, y \in \{-1, 1\}, \|W\|_2 = 1
"\\begin{algorithm}\n\\caption{Hard-SVM}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{solve:} $(w_{0},b_{0}) = \\argmin\\limits_{(w,b)} \\|w\\|^2 \\text{ s.t } \\forall i, y_{i}(\\langle{w,x_i} \\rangle + b) \\ge 1$\n\\STATE \\textbf{output:} $\\hat{w} = \\frac{w_0}{\\|w_0\\|}, \\hat{b} = \\frac{b_0}{\\|w_0\\|}$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 7 Hard-SVM

Require: Training set (x1,y1),,(xm,ym)(\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m)

1:solve: (w0,b0)=arg min(w,b)w2 s.t i,yi(w,xi+b)1(w_{0},b_{0}) = \argmin\limits_{(w,b)} \|w\|^2 \text{ s.t } \forall i, y_{i}(\langle{w,x_i} \rangle + b) \ge 1

2:output: w^=w0w0,b^=b0w0\hat{w} = \frac{w_0}{\|w_0\|}, \hat{b} = \frac{b_0}{\|w_0\|}