maximum likelihood estimation

α=arg maxP(Xα)=arg minilog(P(xiα))\begin{aligned} \alpha &= \argmax P(X | \alpha) \\ &= \argmin - \sum_{i} \log (P(x^i | \alpha)) \end{aligned}

P(α)P(\alpha) captures a priori distribution of α\alpha.

P(αX)P(\alpha | X) is the posterior distribution of α\alpha given XX.

maximum a posteriori estimation

αMAP=arg maxP(αX)=arg maxαP(Xα)P(α)P(X)=arg minα(logP(α))i=1nlogP(xiα)\begin{aligned} \alpha^{\text{MAP}} &= \argmax P(\alpha | X) \\ &= \argmax_{\alpha} \frac{P(X|\alpha)P(\alpha)}{P(X)} \\ &= \argmin_{\alpha}(-\log P(\alpha)) - \sum_{i=1}^{n} \log P(x^i | \alpha) \end{aligned} arg maxWP(xα)P(α)=arg maxW[logP(α)+ilog(xi,yiW)]=arg maxW[ln1βλW22(xiTWyi)2σ2]\begin{aligned} \argmax_{W} P(x | \alpha) P (\alpha) &= \argmax_{W} [\log P(\alpha) + \sum_{i} \log (x^i, y^i | W)] \\ &= \argmax_{W} [\ln \frac{1}{\beta} - \lambda {\parallel W \parallel}_{2}^{2} - \frac{({x^i}^T W - y^i)^2}{\sigma^2}] \end{aligned} P(W)=1βeλW22P(W) = \frac{1}{\beta} e^{\lambda \parallel W \parallel_{2}^{2}}

What if we have

P(W)=1βeλW22r2P(W) = \frac{1}{\beta} e^{\frac{\lambda \parallel W \parallel_{2}^{2}}{r^2}}
arg maxWP(Zα)=arg maxWlogP(xi,yiW)\argmax_{W} P(Z | \alpha) = \argmax_{W} \sum \log P(x^i, y^i | W) P(yx,W)=1γe(xTWy)22σ2P(y | x, W) = \frac{1}{\gamma} e^{-\frac{(x^T W-y)^2}{2 \sigma^2}}

expected error minimisation

Squared loss: l(y^,y)=(yy^)2l(\hat{y},y)=(y-\hat{y})^2

solution to y=arg miny^EX,Y(Yy^(X))2y^* = \argmin_{\hat{y}} E_{X,Y}(Y-\hat{y}(X))^2 is E[YX=x]E[Y | X=x]

Instead we have Z={(xi,yi)}i=1nZ = \{(x^i, y^i)\}^n_{i=1}

error decomposition

Ex,y(yyZ^(x))2=Exy(yy(x))2+Ex(y(x)yZ^(x))2=noise+estimation error\begin{aligned} &E_{x,y}(y-\hat{y_Z}(x))^2 \\ &= E_{xy}(y-y^{*}(x))^2 + E_x(y^{*}(x) - \hat{y_Z}(x))^2 \\ &= \text{noise} + \text{estimation error} \end{aligned}

bias-variance decompositions

For linear estimator:

EZEx,y(y(y^Z(x)WZTx))2=Ex,y(yy(x))2noise+Ex(y(x)EZ(yZ^(x)))2bias+ExEZ(yZ^(x)EZ(yZ^(x)))2variance\begin{aligned} E_Z&E_{x,y}(y-(\hat{y}_Z(x)\coloneqq W^T_Zx))^2 \\ =& E_{x,y}(y-y^{*}(x))^2 \quad \text{noise} \\ &+ E_x(y^{*}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{bias} \\ &+ E_xE_Z(\hat{y_Z}(x) - E_Z(\hat{y_Z}(x)))^2 \quad \text{variance} \end{aligned}